If you are submitting virus data to both GenBank and the Sequence Read Archive (SRA), you can link your GenBank records and related SRA data under the same BioProject (see a BioProject example).
This article outlines your approach (A) if you are preparing new GenBank and SRA submissions or (B) if you want to retroactively link your GenBank records with your other data. The article also addresses (C) situations where you want a BioProject for your GenBank records only.
A. New SRA/GenBank submissions
To establish GenBank-SRA connection for new submissions, follow these steps:You can provide BioProject accessions through the FASTA definition line. In the definition line, you must include the BioProject accession number within square brackets as shown in this generic format (see the example in Figure 1):
>Seq1 [BioProject=PRJNA#####1]
In a similar manner you can also add BioSample and SRA accessions (see the example in Figure 2):
>Seq1 [BioProject=PRJNA#####1] [BioSample=SAMN#####1] [SRA=SRR#####1, SRR#####2]
>Seq2 [BioProject=PRJNA#####1] [BioSample=SAMN#####2] [SRA=SRR#####3, SRR#####4]
This method works for both BankIt and GenBank Submission Portal. Additionally, the portal software can process BioProject/BioSample/SRA accessions if you add them as columns in your Source Modifier table. Note that the “table” method does not work for BankIt due to older software.
Figure 1: An example of two FASTA definition lines with formatted BioProject accession number for submitting through BankIt or GenBank Submission Portal. The definition lines also contain unique sequence_IDs (required). In this example, the submitter provides sequence source information (such as isolate, host, collection_date, …) as table at the Source Modifiers step of their submission process.
Figure 2: An example of two FASTA definition lines with formatted BioProject/BioSample/SRA accession numbers for submitting through BankIt* or GenBank Submission Portal. Each FASTA definition line contains a unique sequence ID (required). The submitter can expand the definition line by adding source modifiers, such as “isolate”, “host”, and “collection_date”, all formatted within square brackets. Adding source information to the FASTA definition line is an alternative that you can use instead of preparing a Source Modifiers table.
While text wrapping in a text editor breaks the line into two, there should be no hard returns within a definition line.
*If submitting through BankIt, submitters can also add the organism name (for example, [organism=Human astrovirus]) to the definition line.
B. Retroactive linking of GenBank records
To retroactively link your accessioned GenBank records with your associated SRA data, send a mapping table (see Figure 3 for an example) to GenBank staff (at gb-admin@ncbi.nlm.nih.gov). The table can be a tab-delimited, plain-text file or an Excel spreadsheet. It should contain:
Figure 3: An example of an accession-mapping table to retroactively link GenBank records with BioProject/BioSample/SRA data.
C. BioProject for GenBank records only
Not having SRA data does not preclude you from registering a BioProject for your virus sequences. For example, you can register a BioProject if you are submitting to GenBank large volumes of sequences over time. In such a case, register the BioProject directly through BioProject Submission Portal. Subsequently, provide the BioProject accession number during your GenBank submissions as we describe in step 2 of section A of the article.