Eukaryotic Genome Annotation Guide
Annotation
table2asn (the replacement of tbl2asn) use a simple five-column tab-delimited table of feature locations and qualifiers in order to generate annotation.
The format of this feature table allows diferent kinds of features (e.g. gene, coding region, tRNA, repeat_region) and qualifiers (e.g. /product, /note) to be indicated. The validator will check for errors such as internal stops in coding regions.
Guidelines for prokaryotic genome submissions .
If you do not understand any of the instructions presented here or you have questions, please contact us by email at genomes@ncbi.nlm.nih.gov prior to creating your submission. This will save us both a lot of time.
Table of Contents
- Prepare annotation table
- Gene features
- locus_tag
- protein_id
- transcript_id
- CDS (coding region) features
- Partial coding regions
- Gene fragments
- Transpliced Genes
- Split genes on two contigs
- Ribosomal RNA, tRNA and other RNA features
- mRNA features
- Alternatively spliced genes
- Evidence Qualifiers
- Data base cross references
- Gene Ontology
Prepare annotation table
The features must be in a simple five-column tab-delimited table, called the feature table. The feature table specifies the location and type of each feature for table2asn (previously tbl2asn) to include in the GenBank submission that is created. The first line of the table contains the following basic information:
>Features SeqID table_name
The SeqID must be the same as the sequence's SeqID in the FASTA file. The table_name is optional. Subsequent lines of the table list the features. Columns are separated by tabs.
- Column 1: Start location of feature
- Column 2: Stop location of feature
- Column 3: Feature key
- Column 4: Qualifier key
- Column 5: Qualifier value
Figure 2 shows a sample feature table and illustrates a number of points about the feature table format. The GenBank flatfile corresponding to this table is shown in Figure 3 . The allowed features and their qualifiers are listed in the Feature Table documentation .
Features that are on the complementary strand, such as the genes Ngs_3038 and Ngs_11232 and their corresponding features shown in Figure 2 , are indicated by reversing the interval locations.
Please avoid unnecessary capitalization in all text entered in your table.
Additional requirements, as well as suggestions for various types of annotation, are included in the following sections.
Gene features
Gene features are always a single interval, and their location should cover the intervals of all the relevant features such as promoters and polyA binding sites.
Gene names should follow the standard nomenclature rules of the particular organism. For example, mouse gene names begin with an uppercase letter, and the remaining letters are lowercase.
Coding regions (CDS) and RNAs, such as tRNAs and rRNAs, must have a corresponding gene feature. However, other features such as repeat_regions and misc_features do not have a corresponding gene or locus_tag.
locus_tag
All genes should be assigned a systematic gene identifier which should receive the locus_tag qualifier on the gene feature in the table. Genes may also have functional names as assigned in the scientific literature. In this example, KCS_0001 is the systematic gene identifier, while Abc5 is the functional gene name.
Table view of gene with both biological name and locus_tag:
1 1575 gene gene Abc5 locus_tag KCS_0001
Flatfile view:
gene 1..1575 /gene="Abc5" /locus_tag="KCS_0001"
Table view of gene with only locus_tag:
1 1575 gene locus_tag KCS_0001
Flatfile view:
gene 1..1575 /locus_tag="KCS_0001"
For consistency the same locus_tag prefix must be used throughout the entire genome. Therefore, all the chromosomes of a genome should have the same locus_tag prefix.
To improve the use of locus_tags we are now requiring that all locus_tag prefixes be registered and that they be unique. We recommend having the BioProject registration process auto-assign a locus_tag prefix, as they are not meant to convey meaning. The locus_tag prefix should be 3-12 alphanumeric characters and the first character may not be a digit. The locus_tag prefix is followed by an underscore and then an alphanumeric identification number that is unique within the given genome. Other than the single underscore used to separate the prefix from the identification number, no special characters can be used in the locus_tag.
The chromosome number can be embedded in the locus_tag, if desired, in the format Prefix_#g#####, where the first # is the chromosome number and ##### is the unique number of the gene. For example, Ajs_4g00123 for a gene on chromosome 4.
Read more about locus_tags and their intended usage.
Please register your genome project and proposed locus_tag prefix on the BioProject registration page prior to preparing your submission to GenBank. Each project that is registered here is assigned a project_id, and in the future we intend that the project_id will appear on all entries associated with a particular genome project.
protein_id
All proteins in a WGS or complete genome must be assigned an identification number by the submitter. We use this number to track proteins when sequences are updated. This number is indicated in the table by the CDS qualifier protein_id, and should have the format gnl|dbname|string, where dbname is a version of your lab name that you think will be unique (eg SmithUCSD), and string is the unique protein SeqID assigned by the submitter. This identifier is saved with the record (in ASN.1 format), but it is not visible in the flatfile. We recommend using the locus_tag as the protein SeqID. In this example, the protein_id for ABC5 is gnl|SmithUCSD|KCS_0001.
Example:
<1 >1575 gene
gene Abc5
locus_tag KCS_0001
pseudogene unprocessed
1 1575 CDS
product ABC5
protein_id gnl|SmithUCSD|KCS_0001
Since the protein_id is used for internal tracking in our database, it is important that the complete protein_id (dbname + SeqID) not be duplicated by a genome center. Thus, if your genome center is submitting more than one complete genome, please be sure to use unique protein_id's for all of the genomes.
The protein_id is also included as a qualifier on the corresponding mRNA feature, to allow the CDS and mRNA to be paired during processing.
Note that when WGS submissions are processed, the dbname in the protein_id is automatically changed to 'WGS:XXXX', where XXXX is the project's accession number prefix.
After your genome is released into GenBank, the proteins are assigned accession numbers. We will provide a table of the protein SeqIDs and accession numbers for you to use in future updates .
transcript_id
The transcript_id is included as a qualifier for both the CDS and its corresponding mRNA. It has the same format as the protein_id, gnl|dbname|identifier. Because each transcript_id and protein_id must be unique, we suggest adding 'mrna' or 't' to the protein_id identifier as a simple way to create the corresponding (unique) transcript_id. However, you can use whatever naming convention you choose, as long as all of the identifiers are unique.
63574 87173 gene
locus_tag Ngs_17131
63574 63907 mRNA
75690 75730
84396 85536
85598 85773
85836 86109
86173 86467
86555 86670
86731 87173
product hypothetical protein
protein_id gnl|ncbi|Ngs_17131
transcript_id gnl|ncbi|mrna.Ngs_17131
84402 85536 CDS
85598 85773
85836 86109
86173 86467
86555 86670
86731 86882
product hypothetical protein
protein_id gnl|ncbi|Ngs_17131
transcript_id gnl|ncbi|mrna.Ngs_17131
CDS (coding region) features
All CDS features must have a product qualifier (protein name). NCBI protein naming conventions are adopted from the International Protein Nomenclature Guidelines.
Consistent nomenclature is indispensable for communication, literature searching and data retrieval. Many species-specific communities have established gene nomenclature committees that try to assign consistent and, if possible, meaningful gene symbols. Other scientific communities have established protein nomenclatures for a set of proteins based on sequence similarity and/or function. But there is no established organization involved in the standardization of protein names, nor are there any efforts to establish naming rules that are valid across the largest spectrum of species possible.
Ambiguities regarding gene/protein names are a major problem in the literature and it is even worse in the sequence databases which tend to propagate the confusion. For this reason, we ask that you follow some basic guidelines in naming your proteins. The protein naming guidelines are based on the premise that a good and stable recommended name for a protein is a name that is as neutral as possible.
Guidelines for naming proteins:
- If it exists, use the approved nomenclature.
- Use a concise name, not a description or phrase.
- Ideally, the name should be unique and attributed to all orthologs.
- In cases where the protein name is not known use "hypothetical protein" or "uncharacterized protein" as the product name.
- The protein name should not reflect the the protein's subcellular location, its domain structure, its molecular weight or its species of origin. This information can be included in the note.
- For proteins that belong to a multigene family, it is recommended that you choose a coherent nomenclature with numbers to specify the different members of the family.
- When naming proteins which can be grouped into a family based on homology or according to a notion of shared function, the different members should be enumerated with a dash "-" followed by an Arabic number. e.g. "desmoglein-1", "desmoglein-2", etc.
- Proteins of unknown function which contain a defined domain or motif, can be named according to the domain present. The name should be of the following type: "<domain|repeat>-containing protein". e.g. "PAS domain-containing protein 5".
- Protein names may be denoted by the same symbol as the corresponding gene, but in the correct format for the organism. For example, mouse proteins have the same symbol as the gene name, but the protein name has all capital letters.
- Greek letters must be written in full e.g. "alpha", and written entirely in lower case with the exception of "Delta" in the context of steroid/fatty acid metabolism nomenclature. Additionally the Greek letters that are followed by a number should be preceded or followed by a dash "-" e.g. "unicornase alpha-1".
- Use lowercase letters, except when uppercase are required (for example, in acronyms such as DNA or ATP).
- Wherever appropriate, the name should use American spelling conventions.
- Avoid the use of molecular weights in protein names; "unicornase subunit A" is preferred to "unicornase 52 kDa subunit"
- Avoid the term "homolog" in a protein as this infers an evolutionary relationship that has generally not been determined.
- Avoid the use of commas in protein names whenever possible.
- Avoid the use of Roman numerals where possible. Use Arabic numbers instead.
- Do not build molecular weights into abbreviations
- Do not use diacritics, such as accents, umlauts. Many computer systems (ours included) can only understand ASCII characters.
- Do not use plurals in a protein name. e.g. "ankyrin repeats-containing protein 8" is wrong.
Here are some examples of good protein names:
- cytochrome b
- CytB
- aconitate hydrase B
- hypothetical protein
- cytochrome b-like protein
- 4Fe-4S cluster binding protein
- adenylyltransferase/ADP-heptose synthase
- 2-hydroxyhepta-2,4-diene-1,7-dioate isomerase
- short-chain specific acyl-CoA dehydrogenase
- formylmethanofuran--tetrahydromethanopterin formyltransferase
- serine/threonine-protein kinase
- translation initiation factor 1
- triphosphoribosyl-dephospho-CoA synthetase
- thiamine biosynthesis protein ThiC
- PAS domain-containing protein 5
- ABC transporter ATP-binding protein AlbC
- stage 0 sporulation protein J
- These names all concisely describe the function of the protein, where known, and avoid references to structure, homology and species.
Here are some examples of bad protein names:
- required for the efficient incorporation of molybdate into molybdoproteins
- This describes the protein's role in a biosynthetic process but is not a protein name.
- chaperone Hsp70; DNA biosynthesis; autoregulated heat shock proteins
- The name "chaperone Hsp70" is fine however the remaining comments would be best fielded as a note or in the function qualifier.
- putative carbonic anhdrase (EC 4.2.1.1)
- The EC number should not be part of the protein name but instead fielded in the EC_number qualifier
- similar to aconitrate hydrase B
- This statement is fine as a note, however as a protein name aconitrate hydrase B-like protein is preferred
- related to protein of unknown function
- uninformative name
- cytochrome b-like
- cytochrome b-like protein is preferred
- ABC transporter related
- vague name, there are many ABC transporters and subunits, be more specific. "ABC transporter-related protein" would be acceptable but a more specific name would be better, if possible.
- pirin, N-terminal:pirin, C-terminal
- uniformative name noting similarity in N and C terminus
- helix-turn-helix motif
- Describes a motif or structural domain but is not an appropriate protein name.
- PP-loop
- Describes a motif or structural domain but is not an appropriate protein name.
- alpha/beta hydrolast fold
- Describes a motif or structural domain but is not an appropriate protein name.
- pentapeptide repeat
- Describes a motif or structural domain but is not an appropriate protein name.
- phosphopantetheine-binding domain
- Describes a motif or structural domain but is not an appropriate protein name.
- protein of unknown function:conserved
- uninformative name
- hypothetical 32.5 kDa protein homologous to phytoene and squalene synthethases
- Hypothetical protein alone is appropriate. The remaining comments should be fielded as a note.
- ribosomal protein L3 (E. coli)
- Protein names should not contain references to organism names. Ribosomal protein L3 is an appropriate name by itself.
- saccharopine dehydrogenase or related protein
- "saccharopine dehydrogenase" or "saccharopine dehydrogenase-like protein" would be more appropriate
- tyrosine-protein kinase (capsular polysaccharide biosynthesis)
- tyrosine-protein kinase is fine as a protein name but capsular polysaccharide biosynthesis would be more appropriate as a function.
- RimM protein, required for 16S rRNA processing
- RimM is fine as a protein name but descriptive comments should be placed in the note.
- involved in flagellar biosynthesis
- This is a functional comment and not a protein name.
Notes
Please avoid including notes indicating a specific percentage of similarity to other entries in the database, since the corresponding record that you have pointed to may change and make your current note inaccurate, incorrect and obsolete. Descriptions, notes describing similarity to other proteins, and functional comments must be placed in the appropriate CDS qualifiers such as note, or prot_desc, as they are descriptors of the product. E.C. numbers must be fielded in an EC_number qualifier.
start stop CDS
product DNA gyrase subunit B
EC_number 5.99.1.3
Qualifiers that can be used on the CDS feature are:
start stop CDS
product
prot_desc
function
EC_number
note
experiment
inference
go_component
go_process
go_function
db_xref
pseudo
exception
transl_except
Multiple note qualifiers can be included and will be concatenated by table2asn into a single note with semi-colons as separators.
Bifunctional proteins:
If a protein contains two separate and distinct functions or if it has more than one name, these can be annotated in several ways, as outlined below. Table view:
start stop CDS
product adenylyltransferase/ADP-heptose synthase
note bifunctional
or
start stop CDS
product bifunctional adenylyltransferase/ADP-heptose synthase cyclohydrolase
or
start stop CDS
product FolD
function adenylyltransferase
function ADP-heptose synthase cyclohydrolase
note bifunctional
Partial coding regions
To annotate a partial coding region, you should use the "<" or ">" in your feature table to designate the feature as either 5' or 3' partial. The coding region should begin at the first nucleotide present in the sequence or exon, and you will indicate where the first complete codon begins in that coding region.
Partial genes within a sequence should begin or end at consensus splice sites.
Examples:
In the first example below, the "<" designates this coding region as 5' partial and "codon_start 3" tells the software to start translation with the third nucleotide of the CDS. Note that if the codon_start is not specified, then the software assumes a codon_start of 1. The second coding region below is partial at the 3' end so ">" is used to indicate a 3' partial feature. The third example is of a 3' partial coding region on the complementary or minus strand.
<1 497 CDS
product transcription factor
note similar to Bacillus subtilis aldolase
codon_start 3
protein_id gnl|dbname|KCS_0001
transcript_id gnl|dbname|mrna.KCS_0001
600 >1575 CDS
product actin-like protein
protein_id gnl|dbname|KCS_0002
transcript_id gnl|dbname|mrna.KCS_0002
436 >1 CDS
product hypothetical protein
protein_id gnl|dbname|KCS_0056
transcript_id gnl|dbname|mrna.KCS_0056
Here are more examples of formatting partial CDS features .
mRNA features
Include an mRNA feature for each translated CDS. Several things to note are:
- Use the same product name for the mRNA and its corresponding CDS.
- If there is no UTR information, then the mRNA's location will agree with its CDS's location, but the mRNA will be partial at its 5' and 3' ends.
- Extend the gene feature to include the entire mRNA.
- If the mRNA is partial, then make the gene partial.
Examples:
The first example is a complete CDS whose 5' and 3' UTRs are known.
>Feature Cont54
10400 12512 gene
locus_tag CCC_03116
10400 10462 mRNA
10533 10577
10651 11098
11182 11642
11716 12512
product hypothetical protein
protein_id gnl|dbname|CCC_03116
transcript_id gnl|dbname|mrna.CCC_03116
10450 10462 CDS
10533 10577
10651 11098
11182 11642
11716 12233
product hypothetical protein
protein_id gnl|dbname|CCC_03116
transcript_id gnl|dbname|mrna.CCC_03116
The second example is a CDS that is partial at the 5' end and lacks any 3' UTR information.
>Feature Cont3
<1 >497 gene
locus_tag CCC_111011
<1 497 CDS
note similar to Bacillus subtilis aldolase
product aldolase-like protein
codon_start 3
protein_id gnl|dbname|CCC_111011
transcript_id gnl|dbname|mrna.CCC_111011
<1 >497 mRNA
product aldolase-like protein
protein_id gnl|dbname|CCC_111011
transcript_id gnl|dbname|mrna.CCC_111011
Gene fragments
Sometimes a genome will have adjacent or nearby genes that seem to be only part of a protein. In many cases these indicate a possible problem with the sequence and/or annotation. A related issue is the presence of internal stop codons in the conceptual translation of a CDS that looks like it should be a real CDS. These problems may be due to a variety of reasons, including mutations or sequencing artifacts. They can be annotated in a number of ways:
-
Annotate the gene with /pseudo to indicate that there is a problem with the gene. Note that this qualifier does NOT mean that the gene is a pseudogene. (see point 2, below, if it is known that the gene IS a pseudogene) If multiple gene fragments were present initially, then add a single gene feature which covers all of the potential coding regions and add the pseudo qualifier. If known, a note qualifier may be added indicating why this gene is disrupted, for example:
1 200 gene gene Abc5 locus_tag KCS_0001 gene_desc alkaline phosphatase pseudo note nonfunctional due to frameshift
-
If you are sure that the disrupted or error-filled gene is a biological pseudogene, then add the pseudogene qualifier and the appropriate pseudogene type. For example:
1 200 gene gene Abc5 locus_tag KCS_0001 gene_desc alkaline phosphatase pseudogene unprocessed
-
If the feature is just noting a similarity to genes in the database and is probably not translated, then it should be annotated as a misc_feature without a corresponding gene feature.
1 200 misc_feature note similar to Abc5
Transpliced Genes
Transpliced genes are the exception to the rule for annotating gene feature spans. Transpliced genes are similar to intron containing genes except the two pieces of the gene are found on different regions of the chromosome. These genes are transcribed as two or more separate RNA products that are transpliced into a single mRNA or tRNA. To annotate this using a table, enter the nucleotide spans so that the complementary (minus strand) spans are arranged from high to low and vice versa for the plus strand.
36700 36618 gene
86988 87064
locus_tag NEQ_t38
exception trans-splicing
36631 36618 misc_feature
note sequence cleaved during processing of trans-spliced tRNAs
36673 36635
87030 87064 tRNA
product tRNA-Glu
exception trans-splicing
note this trans-spliced tRNA consists of two halves on mixed strands; it shares a 3' half with another tRNA
Flatfile view:
gene join(complement(36618..36700),86988..87064)
/locus_tag="NEQ_t38"
/trans_splicing
misc_feature complement(36618..36631)
/locus_tag="NEQ_t38"
/note="sequence cleaved during processing of trans-spliced tRNAs"
tRNA join(complement(36635..36673),87030..87064)
/locus_tag="NEQ_t38"
/product="tRNA-Glu"
/trans_splicing
/note="this trans-spliced tRNA consists of two halves on
mixed strands; it shares a 3' half with another tRNA"
Split genes on two contigs
NEW (Sept 2012): Sometimes in incomplete genomes the ends of a gene may be on different contigs. When certain that the two pieces are part of the same gene, annotate these as separate genes with unique locus_tags, plus separate CDS/mRNAs with different protein_id's and transcript_id's. In addition, link the features together with notes that refer to the other part of the gene. However, do not create extremely short features, for example if one end is only the start methinione or only a few amino acids before the stop codon.
>Feature Cont01.00111 5000 >7500 gene locus_tag KCS_2223A 5000 5500 mRNA 6000 >7200 product enolase protein_id gnl|dbname|KCS_2223A transcript_id gnl|dbname|mrna.KCS_2223A 5488 5500 CDS 6000 >7200 product enolase protein_id gnl|dbname|KCS_2223A transcript_id gnl|dbname|mrna.KCS_2223A note 5' end; 3' end is gene KCS_2223B on contig Cont01.00224
>Feature Cont01.00224 <1 1000 gene locus_tag KCS_2223B <100 1000 mRNA product enolase protein_id gnl|dbname|KCS_2223B transcript_id gnl|dbname|mrna.KCS_2223B <100 876 CDS product enolase protein_id gnl|dbname|KCS_2223B transcript_id gnl|dbname|mrna.KCS_2223B note 3' end; 5' end is gene KCS_2223A on contig Cont01.00111
Alternatively spliced genes
In many cases a gene can be alternatively spliced, yielding alternative transcripts. These transcripts may differ in the coding region and produce different products, or they may differ in the non-translated 5' or 3' UTR and produce the same protein. To annotate alternatively spliced genes, include one mRNA and CDS for each transcript, and include only one gene over all of the features. Give the corresponding mRNA and CDS the same name, and include a note "alternatively spliced" on each. If there are multiple CDS with the same name, then add a note to each mRNA and CDS to refer to each other, eg "transcript variant A" and "encoded by transcript variant A" for one mRNA/CDS pair. If the CDS have different translations, then they should have different product names. Make sure that all the proteins have unique protein_id's.
Example 1 (different products):
>Feature Cont01.00055 10 5000 gene locus_tag CCC_04562 10 500 mRNA 722 1555 2548 3901 4400 5000 product enolase isoform A note alternatively spliced protein_id gnl|dbname|CCC_04562A transcript_id gnl|dbname|mrna.CCC_04562A 102 500 CDS 722 1555 2548 3901 4400 4566 product enolase isoform A note alternatively spliced protein_id gnl|dbname|CCC_04562A transcript_id gnl|dbname|mrna.CCC_04562A 10 500 mRNA 2548 3901 4400 5000 product enolase isoform B note alternatively spliced protein_id gnl|dbname|CCC_04562B transcript_id gnl|dbname|mrna.CCC_04562B 102 500 CDS 2548 3901 4400 4566 product enolase isoform B note alternatively spliced protein_id gnl|dbname|CCC_04562B transcript_id gnl|dbname|mrna.CCC_04562B
Example 2 (same product):
>Feature Cont01.00056 100 1000 gene locus_tag CCC_03222 100 333 mRNA 444 678 800 1000 product hypothetical protein note transcript variant A; alternatively spliced protein_id gnl|dbname|CCC_03222A transcript_id gnl|dbname|mrna.CCC_03222A 456 678 CDS 800 865 product hypothetical protein note encoded by transcript variant A; alternatively spliced protein_id gnl|dbname|CCC_03222A transcript_id gnl|dbname|mrna.CCC_03222A 100 360 mRNA 444 678 800 1000 product hypothetical protein note transcript variant B; alternatively spliced protein_id gnl|dbname|CCC_03222B transcript_id gnl|dbname|mrna.CCC_03222B 456 678 CDS 800 865 product hypothetical protein note encoded by transcript variant B; alternatively spliced protein_id gnl|dbname|CCC_03222B transcript_id gnl|dbname|mrna.CCC_03222B
Ribosomal RNA, tRNA and other RNA features
RNA features (rRNA, tRNA, ncRNA) need a corresponding gene feature with a locus_tag qualifier. If the amino acid of a tRNA is unknown, use tRNA-Xxx as the product, as in the example. Many submitters like to label the tRNAs such as tRNA-Gly1, etc. If you wish to do this please include "tRNA-Gly1" as a note and not in /gene. The use of /gene is reserved for the actual biological gene symbol such as "trnG". If a tRNA is a pseudogene, please use the /pseudo qualifier.
Annotate ncRNAs that belong to one of the INSDC nRNA_class as an ncRNA feature, with the appropriate value in the required /ncRNA_class qualifier. Regions of an RNA should be annotated as a misc_feature (eg, leader sequences), or a misc_binding feature if they bind a known molecule (eg, riboswitches). If the RFAM identifier is known, it can be included as a db_xref .
Some rRNA, tRNA and ncRNA examples:
<1 400 gene
locus_tag KCS_00011
<1 400 rRNA
product 16S ribosomal RNA
488 560 gene
locus_tag KCS_00012
488 560 tRNA
product tRNA-Lys
570 601 gene
locus_tag KCS_00020
pseudo
570 601 tRNA
product tRNA-Phe
pseudo
700 780 gene
locus_tag KCS_00013
700 780 tRNA
product tRNA-Xxx
900 923 gene
locus_tag KCS_00014
900 923 ncRNA
ncRNA_class miRNA
product mir-9c
950 1000 gene
locus_tag KCS_00015
950 1000 tmRNA
product tmRNA
Riboswitches used to be annotated using the misc_binding feature if the bound moiety was known, for example:
1 100 misc_binding
note cobalamin riboswitch
bound_moiety adenosylcobalamin
New in 2017: annotate riboswitches as regulatory features with the regulatory_class 'riboswitch':
1 100 regulatory
regulatory_class riboswitch
note cobalamin riboswitch
bound_moiety adenosylcobalamin
If the bound moiety is unknown or if the sequence is a leader sequence, annotate as a misc_feature, for example:
1 100 misc_feature
note yybP-ykoY element
misc_feature and misc_binding and regulatory features do not have an associated gene feature. If it is desired to tag these features with a locus_tag-like identifier, then include that value in the note, separated from other information by a semi-colon and space.
Evidence Qualifiers
The International Nucleotide Sequence Database Collaboration, DDBJ, EMBL and GenBank has adopted a set of new qualifiers to describe the evidence for feature annotation in GenBank records. These are:
/experimental="text" /inference="TYPE:text", where 'TYPE' is from a select list and 'text' is structured text.
These qualifiers replace /evidence=experimental and /evidence=non-experimental, respectively, which are no longer supported.
See more information about the Evidence Qualifiers .
Database cross references
A variety of database cross references can be added to a feature. These appear as /db_xref on the features. This qualifier serves as a vehicle for linking of sequence records to other external databases. See the full list of db_xref databases.
1 100 CDS
product RecA
protein_id gnl|center_name|Test_0001
db_xref InterPro:IPR000111
180 210 misc_feature
note yybP-ykoY element
db_xref RFAM:RF00080
Gene Ontology
GO (Gene Ontology) terms can be included in genomes in order to describe protein functionality. Gene Ontology (GO) terms can be indicated with the following qualifiers
1 100 CDS
product helicase
go_process chromatin assembly or disassembly|0006333||IEA
go_process antimicrobial humoral response|0019730|16163390|IMP
go_component nucleus|0005634|14668392|IDA
go_component chromatin|0000785||IEA
go_function ATP-dependent helicase activity|0008026||ISS
go_function nucleic acid binding|0003676||IEA
go_function ATP binding|0005524||IEA
The value field is separated by vertical bars '|' into a descriptive string, the GO identifier (leading zeroes are retained), and optionally a PubMed ID and one or more evidence codes. The evidence code is the fourth token, so include blank fields, as necessary (eg the last qualifier has no PubMed ID so the third field is blank).
Genome Resources
- About WGS
- WGS Browser
- Genome Submission Guide
- Genome Submission Portal
- Update Genome Records
- FAQ
- table2asn
- Submitting Multiple Haplotype Assemblies
- Create Submission Template
- Eukaryotic Annotation Guide
- Prokaryotic Annotation Guide
- Annotation Example Files
- Annotating Genomes with GFF3 or GTF files
- Validation Error Explanations for Genomes
- Discrepancy Report
- NCBI Prokaryotic Genome Annotation Pipeline
- AGP Format
- Metagenome Submission Guide
- Structured Comment
- BioProject
- BioSample