16S RefSeq records processing and curation
A curated collection of 16S ribosomal rRNA sequences from bacteria and archaea type materials was created with the goals of:
- Providing tools to better analyze and validate rRNA sequence data
- Maintain up to date and complete taxonomic information for bacteria/archaea type materials
- Improve genome annotation
- Allow users to make identifications based on sequence data
- Allow users to extract specific sub-sets of data
The collection was created by first extensively updating NCBI taxonomic resources to include the most up to date lists of published bacteria/archaea names and associated type materials. Corresponding GenBank submissions were updated to include the most recent taxonomic information. NCBI Entrez search engine was used to retrieve all relevant 16S ribosomal RNA sequences from the type materials of bacteria/archaea. The data was subsequently evaluated to ensure it was taxonomically correct, free of annotation errors, publications were added/updated and the sequences were checked by several techniques to validate them. Sequence validation steps included:
- Long sequences were trimmed so that they only contain the 16S rRNA gene
- Low quality sequences were removed or trimmed
- BLAST (nr and existing rRNA reference collections)
- Vector screening and removal of terminal Ns
- Alignments
- Chimera check
- Intron annotation was added/corrected
Sequences deemed of lower quality or those that did not pass the validation steps were excluded. Where possible near full length sequences were selected preferentially, however in some cases shorter sequences were included if no longer sequence was available. INSD records that passed the validations steps were then used to create reference sequence entries. Ongoing work will add to this set as more type strains are published.
16S RefSeq Nucleotide sequence records