Sequence Identifiers
Many sequences have two types of identification numbers, GI and VERSION . The two identifier types differ in format , and were implemented at different times.
GI numbers
A GI number (for GenInfo Identifier, sometimes written in lower case, " gi ") is a simple series of digits that are assigned consecutively to each sequence record processed by NCBI. The GI number bears no resemblance to the Version number of the sequence record. Each time a sequence record is changed, it is assigned a new GI number.
A nucleotide sequence GI number is shown in the VERSION field of the database record. A protein sequence GI number is shown in the VERSION field of a protein database record, and is cross-referenced in the CDS/db_xref field of a nucleotide database record.
Sequence Versions
A sequence Version groups all of the gi numbers for a specific sequence into an ordered series. A sequence version number consists of a base Accession number, a dot, and a version suffix that starts with 1 1
. (This identifier is often referred to as an " accession dot version ".) The base Accession number identifies the sequence record, and the version suffixes form the series of versions, starting with 1 1
. A sequence Accession number without a version suffix always refers to the latest version of the sequence.
The two systems of identifiers run in parallel to each other. That is, when any change is made to a sequence, it both receives a new GI number, and the version part of its accession number is incremented by 1.
For example, here is the sequence revision history of Reference Sequence Human Chromosome 1 , as of October 2014:
Accession.Version | gi | Date |
---|---|---|
NC_000001.11 | 568815597 | Feb 3, 2014 11:01 PM |
NC_000001.10 | 224589800 | Aug 13, 2013 12:15 PM |
NC_000001.10 | 224589800 | Mar 5, 2013 02:59 PM |
NC_000001.10 | 224589800 | Mar 5, 2013 02:13 PM |
NC_000001.10 | 224589800 | Mar 3, 2013 10:59 PM |
NC_000001.10 | 224589800 | Oct 30, 2012 08:39 PM |
NC_000001.10 | 224589800 | Jul 24, 2012 03:18 PM |
NC_000001.10 | 224589800 | Jul 29, 2011 05:58 AM |
NC_000001.10 | 224589800 | Oct 25, 2010 05:33 PM |
NC_000001.10 | 224589800 | Jun 10, 2009 04:09 PM |
NC_000001.9 | 89161185 | Mar 3, 2008 05:58 PM |
NC_000001.9 | 89161185 | Aug 30, 2006 12:10 PM |
NC_000001.9 | 89161185 | Mar 3, 2006 05:23 PM |
NC_000001.8 | 51511461 | Oct 25, 2004 02:33 PM |
NC_000001.8 | 51511461 | Aug 24, 2004 04:34 PM |
Accession.Version | gi | Date |
---|---|---|
NC_000001.8 | 51511461 | Aug 24, 2004 11:05 AM |
NC_000001.7 | 42406218 | Feb 20, 2004 09:34 AM |
NC_000001.7 | 42406218 | Feb 4, 2004 03:56 PM |
NC_000001.6 | 42405892 | Feb 4, 2004 12:17 PM |
NC_000001.5 | 37623929 | Jan 28, 2004 04:08 PM |
NC_000001.5 | 37623929 | Oct 23, 2003 11:08 AM |
NC_000001.5 | 37623929 | Oct 17, 2003 10:45 AM |
NC_000001.5 | 37623929 | Oct 16, 2003 03:44 PM |
NC_000001.5 | 37623929 | Oct 10, 2003 01:19 PM |
NC_000001.4 | 29824572 | May 6, 2003 10:42 AM |
NC_000001.4 | 29824572 | Apr 12, 2003 11:33 AM |
NC_000001.3 | 29824110 | Apr 11, 2003 11:54 PM |
NC_000001.2 | 27777714 | Feb 14, 2003 04:18 PM |
NC_000001.2 | 27777714 | Jan 17, 2003 12:40 PM |
NC_000001.1 | 22539468 | Aug 29, 2002 04:14 PM |
Note that the gi number doesn't change every time the record is modified. Only changes to the sequence data trigger assignment of a new gi; minor updates are tracked, but don't change the gi or version number. But note that every time the gi changes, the version number is incremented.
See Sequence Revision History for more details.
Historical Note
The GI number has been used for many years by NCBI to track sequence histories in GenBank and the other NCBI sequence databases. The Accession.Version system of identifiers was adopted in February 1999 by the International Nucleotide Sequence Database Collaboration (GenBank, EMBL, and DDBJ).
The first type of sequence identification number was GI, which stands for "GenInfo Identifier." GenInfo was an early system used to access GenBank and related databases. A GI number was assigned to each nucleotide and protein sequence accessible through the NCBI search systems, and was a means of tracking changes to the sequence. However, GI numbers were not used uniformly across the collaborating databases (GenBank, EMBL, DDBJ). They instead served as an internal tracking system for the databases that chose to implement them. In addition, the gi number for a nucleotide sequence originally appeared in the COMMENT
field of a record. There was no separate field for sequence identification numbers.
When the collaborating databases began to formalize use of sequence identifiers, they created a new, separate field called NID (nucleotide identifier) in the database record, which contained the GI number of the nucleotide sequence. Similarly, the GI number for each protein sequence was named PID , and placed above each amino acid translation in the field: FEATURES/CDS/db_xref="PID:gNNNNNN". Hence, there became two types of gi numbers: NID and PID. In December 1999, the use of the abbreviations "NID" and "PID" was discontinued. Both are now just shown as "GI".
In February 1999, GenBank/EMBL/DDBJ implemented a new " accession.version " system of sequence identifiers that runs parallel to the gi number system.
Unlike the gi number system, in which sequence identification numbers were not necessarily consistent across the databases (e.g., GenBank and EMBL could each assign their own gi number to a sequence), the new system is designed to ensure consistency. It is also designed to show a relationship between a sequence identification number and the accession number of the record in which it is found. In contrast, GI numbers are assigned consecutively and bear no resemblance to the accession number. Finally, the new system allows the assignment of alphanumeric protein IDs to proteins translations within nucleotide sequence records. The protein IDs contain three letters followed by five digits, a period, and a version number.
Since December 1999 (GenBank release 115.0):
- the NID field and
/db_xref="PID:xxxxxxx"
qualifier were removed, and both are now simply shown as "GI" numbers - the VERSION field of nucleotide records contain both an accession.version and a GI number for the nucleotide sequence
- each amino acid translation is labeled with an accession.version sequence identifier (in the
/protein_id
field) and a GI number (in the/db_xref=GI:xxxxxxx
qualifier), under theCDS
feature of a GenBank record - the accession.version and GI systems of sequence identifiers run in parellel to each other. Therefore, when any change is made to a sequence, it receives a new GI number, and its version suffix is incremented by 1
For more information, see section 3.4.7 ("VERSION") of the current GenBank release notes .