Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2005 Oct;272(20):5101-9.
doi: 10.1111/j.1742-4658.2005.04945.x.

Protein database searches using compositionally adjusted substitution matrices

Affiliations
Review

Protein database searches using compositionally adjusted substitution matrices

Stephen F Altschul et al. FEBS J. 2005 Oct.

Abstract

Almost all protein database search methods use amino acid substitution matrices for scoring, optimizing, and assessing the statistical significance of sequence alignments. Much care and effort has therefore gone into constructing substitution matrices, and the quality of search results can depend strongly upon the choice of the proper matrix. A long-standing problem has been the comparison of sequences with biased amino acid compositions, for which standard substitution matrices are not optimal. To address this problem, we have recently developed a general procedure for transforming a standard matrix into one appropriate for the comparison of two sequences with arbitrary, and possibly differing compositions. Such adjusted matrices yield, on average, improved alignments and alignment scores when applied to the comparison of proteins with markedly biased compositions. Here we review the application of compositionally adjusted matrices and consider whether they may also be applied fruitfully to general purpose protein sequence database searches, in which related sequence pairs do not necessarily have strong compositional biases. Although it is not advisable to apply compositional adjustment indiscriminately, we describe several simple criteria under which invoking such adjustment is on average beneficial. In a typical database search, at least one of these criteria is satisfied by over half the related sequence pairs. Compositional substitution matrix adjustment is now available in NCBI's protein-protein version of blast.

PubMed Disclaimer

Figures

Figure 1
Figure 1. ROCn curves for the aravind103 and astral40 data sets using standard BLOSUM-62 and conditionally compositionally adjusted BLOSUM-62
The BLAST program [25, 26, 29] was used to compare the test query sets to the test databases, with database sequences filtered of low-complexity segments using the SEG program [36] with parameters (10, 1.8, 2.1). Search results were pooled and ranked by E-value, and ROCn curves [29, 34] were obtained by plotting true positives versus false positives for increasing E-values. For each test set, local alignment scores [9] were calculated using BLOSUM-62 substitution scores [13] and affine gap costs [40, 41]. Composition-based statistics [29] were employed in order to obtain accurate E-values. Specifically, for sufficiently high-scoring alignments, the BLOSUM-62 substitution scores were scaled to have an ungapped λ [10] of 0.006352 in the context of the two sequences being compared, and were used in conjunction with scores of -550-50k for a gap of length k. Gapped statistical parameters have been estimated for this scoring system using random simulation [42], and scaling arguments [26, 29]. Also, for each test set, a second run was performed with conditionally compositionally adjusted BLOSUM-62 substitution scores, constrained to have a relative entropy of 0.44 nats in the context of the two sequences being compared (mode C). (a) The aravind103 test set was compared to a yeast protein sequence database that had been edited to remove extra copies of highly similar sequences [29]. (b) A subset of 3586 sequences from the astral40 data set [30, 31] was used as queries against astral40; all self-comparisons were excluded.

Comment in

  • Identifying protein interactions.
    Appella E, Anderson CW. Appella E, et al. FEBS J. 2005 Oct;272(20):5099-100. doi: 10.1111/j.1742-4658.2005.04944.x. FEBS J. 2005. PMID: 16218943 No abstract available.

Similar articles

Cited by

References

    1. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48:443–53. - PubMed
    1. McLachlan AD. Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c 551. J Mol Biol. 1971;61:409–24. - PubMed
    1. Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. C. (1978) A model of evolutionary change in proteins in Atlas of Protein Sequence and Structure (Dayhoff, M. O., ed) pp. 345–52, Natl Biomed Res Found, Washington, DC.
    1. Schwartz, R. M. & Dayhoff, M. O. (1978) Matrices for detecting distant relationships in Atlas of Protein Sequence and Structure (Dayhoff, M. O., ed) pp. 353–58, Natl Biomed Res Found, Washington, DC.
    1. Feng DF, Johnson MS, Doolittle RF. Aligning amino acid sequences: comparison of commonly used methods. J Mol Evol. 1984;21:112–25. - PubMed