Print

How can I download a list of IDs for all sequences from a specific organism or taxonomic group at NCBI?

Use one of these three approaches:  

(1) Directly from the web; suitable only for organisms or taxonomic groups that have a relatively small number of sequence records in the Nucleotide or Protein database:
  • Access the sequence database that you want on the web, for example Nucleotide.
  • Search for your organism by entering your organism name limited to the organism field, for example:
Salarchaeum japonicum[organism]
  • Use the Send to link (located top right above the results on the search results page) and select File.
  • Select either Accession List or GI List as your Format and use the Create File button to download the list.
 
(2) E-utilities; use the NCBI E-utilities API for organisms or taxonomic groups that have a large number of sequence records in the Nucleotide or Protein database:
  • Use esearch to search, for example, for all Archaea sequence records in the Nucleotide database. Search URL example:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nucleotide&term=Archaea[organism]&usehistory=y

The usehistory parameter will generate the Web environment (&WebEnv) and query key (&query_key) parameters that will specify the location of the retrieved GIs on the Entrez history server.
  • Follow with efetch. Your URL should include the query key number and the web environment (WebEnv) string generated by esearch.  Specify the rettype as uilist and retmode as text. Example:
efetch.fcgi?db=nucleotide&query_key=<key>&WebEnv=<webenv string>&rettype=uilist&retmode=text
 
(3) EDirect; use Entrez Direct (EDirect) as the UNIX command line alternative to E-utilities:
 
EDirect is a relatively new method for searching and accessing records in NCBI databases. It uses UNIX command line arguments, so you need to have access to a UNIX/LINUX terminal. EDirect will run on UNIX and Macintosh computers that have the Perl language installed, and under the Cygwin UNIX-emulation environment on Windows PC's.

Here are command line examples that would generate the GI list or the accession list for all Archaea records in the Protein database:
esearch -db protein -query "Archaea[organism]" | efetch -db protein -format uid > archaea.gis
esearch -db protein -query "Archaea[organism]" | efetch -db protein -format acc > archaea.acc