Print

How do I associate my virus sequences in GenBank with my data in the Sequence Read Archive (SRA)?

If you are submitting virus data to both GenBank and the Sequence Read Archive (SRA), you can link your GenBank records and related SRA data under the same BioProject (see a BioProject example).
This article outlines your approach (A) if you are preparing new GenBank and SRA submissions or (B) if you want to retroactively link your GenBank records with your other data. The article also addresses (C) situations where you want a BioProject for your GenBank records only.

 

A. New SRA/GenBank submissions

To establish GenBank-SRA connection for new submissions, follow these steps:
Step 1
  • Start by submitting your reads to SRA first (see guidelines).
  • During SRA submission, register BioProject/BioSamples (see introductory video).
  • Once your SRA submission is processed, you will have accession numbers of the following formats:
    • BioProject: PRJNA#####1
    • BioSample: SAMN#####1, SAMN#####2,
    • SRA runs: SRR#####1, SRR#####2, SRR#####3, SRR#####4, …
Step 2
  • Select the GenBank submission path according to the virus organism(s) that you sequenced (you can submit complete or partial genomes or individual-gene sequences):
    • For SARS-CoV-2, Influenza A, B, and C, Dengue, or Norovirus use the appropriate submission wizard in the GenBank Submission Portal.
    • For all other viruses, use BankIt.
  • Follow submission guidelines as provided within each tool.
  • During submission, provide the BioProject accession and (if applicable) BioSample and SRA accessions from Step 1.  

You can provide BioProject accessions through the FASTA definition line. In the definition line, you must include the BioProject accession number within square brackets as shown in this generic format (see the example in Figure 1):


>Seq1 [BioProject=PRJNA#####1]


In a similar manner you can also add BioSample and SRA accessions (see the example in Figure 2):


>Seq1 [BioProject=PRJNA#####1] [BioSample=SAMN#####1] [SRA=SRR#####1, SRR#####2]
>Seq2 [BioProject=PRJNA#####1] [BioSample=SAMN#####2] [SRA=SRR#####3, SRR#####4]


This method works for both BankIt and GenBank Submission Portal. Additionally, the portal software can process BioProject/BioSample/SRA accessions if you add them as columns in your Source Modifier table. Note that the “table” method does not work for BankIt due to older software. 

 



Figure 1: An example of two FASTA definition lines with formatted BioProject accession number for submitting through BankIt or GenBank Submission Portal. The definition lines also contain unique sequence_IDs (required). In this example, the submitter provides sequence source information (such as isolate, host, collection_date, …) as table at the Source Modifiers step of their submission process.

 


Figure 2: An example of two FASTA definition lines with formatted BioProject/BioSample/SRA accession numbers for submitting through BankIt* or GenBank Submission Portal.  Each FASTA definition line contains a unique sequence ID (required). The submitter can expand the definition line by adding source modifiers, such as “isolate”, “host”, and “collection_date”, all formatted within square brackets. Adding source information to the FASTA definition line is an alternative that you can use instead of preparing a Source Modifiers table.
While text wrapping in a text editor breaks the line into two, there should be no hard returns within a definition line.
*If submitting through BankIt, submitters can also add the organism name (for example, [organism=Human astrovirus]) to the definition line.

 

 

B. Retroactive linking of GenBank records

To retroactively link your accessioned GenBank records with your associated SRA data, send a mapping table (see Figure 3 for an example) to GenBank staff (at gb-admin@ncbi.nlm.nih.gov).  The table can be a tab-delimited, plain-text file or an Excel spreadsheet. It should contain:
    • GenBank accessions in the first column.
    • Added columns, one for each type of associated accessions.

 


 

Figure 3: An example of an accession-mapping table to retroactively link GenBank records with BioProject/BioSample/SRA data.

 

 

C. BioProject for GenBank records only

Not having SRA data does not preclude you from registering a BioProject for your virus sequences. For example, you can register a BioProject if you are submitting to GenBank large volumes of sequences over time. In such a case, register the BioProject directly through BioProject Submission Portal. Subsequently, provide the BioProject accession number during your GenBank submissions as we describe in step 2 of section A of the article.