Review

. 2005 Oct;272(20):5101-9.

doi: 10.1111/j.1742-4658.2005.04945.x.

Protein database searches using compositionally adjusted substitution matrices

Stephen F Altschul¹, John C Wootton, E Michael Gertz, Richa Agarwala, Aleksandr Morgulis, Alejandro A Schäffer, Yi-Kuo Yu

Affiliations

Affiliation

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA. altschul@ncbi.nlm.nih.gov

PMID: 16218944
PMCID: PMC1343503
DOI: 10.1111/j.1742-4658.2005.04945.x

Review

Protein database searches using compositionally adjusted substitution matrices

Stephen F Altschul et al. FEBS J. 2005 Oct.

. 2005 Oct;272(20):5101-9.

doi: 10.1111/j.1742-4658.2005.04945.x.

Authors

Stephen F Altschul¹, John C Wootton, E Michael Gertz, Richa Agarwala, Aleksandr Morgulis, Alejandro A Schäffer, Yi-Kuo Yu

Affiliation

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA. altschul@ncbi.nlm.nih.gov

PMID: 16218944
PMCID: PMC1343503
DOI: 10.1111/j.1742-4658.2005.04945.x

Abstract

Almost all protein database search methods use amino acid substitution matrices for scoring, optimizing, and assessing the statistical significance of sequence alignments. Much care and effort has therefore gone into constructing substitution matrices, and the quality of search results can depend strongly upon the choice of the proper matrix. A long-standing problem has been the comparison of sequences with biased amino acid compositions, for which standard substitution matrices are not optimal. To address this problem, we have recently developed a general procedure for transforming a standard matrix into one appropriate for the comparison of two sequences with arbitrary, and possibly differing compositions. Such adjusted matrices yield, on average, improved alignments and alignment scores when applied to the comparison of proteins with markedly biased compositions. Here we review the application of compositionally adjusted matrices and consider whether they may also be applied fruitfully to general purpose protein sequence database searches, in which related sequence pairs do not necessarily have strong compositional biases. Although it is not advisable to apply compositional adjustment indiscriminately, we describe several simple criteria under which invoking such adjustment is on average beneficial. In a typical database search, at least one of these criteria is satisfied by over half the related sequence pairs. Compositional substitution matrix adjustment is now available in NCBI's protein-protein version of blast.

PubMed Disclaimer

Figures

**Figure 1. ROCn curves for the aravind103 and astral40 data sets using standard BLOSUM-62 and conditionally compositionally adjusted BLOSUM-62**
The BLAST program [25, 26, 29] was used to compare the test query sets to the test databases, with database sequences filtered of low-complexity segments using the SEG program [36] with parameters (10, 1.8, 2.1). Search results were pooled and ranked by E-value, and ROC_n curves [29, 34] were obtained by plotting true positives versus false positives for increasing E-values. For each test set, local alignment scores [9] were calculated using BLOSUM-62 substitution scores [13] and affine gap costs [40, 41]. Composition-based statistics [29] were employed in order to obtain accurate E-values. Specifically, for sufficiently high-scoring alignments, the BLOSUM-62 substitution scores were scaled to have an ungapped λ [10] of 0.006352 in the context of the two sequences being compared, and were used in conjunction with scores of -550-50k for a gap of length k. Gapped statistical parameters have been estimated for this scoring system using random simulation [42], and scaling arguments [26, 29]. Also, for each test set, a second run was performed with conditionally compositionally adjusted BLOSUM-62 substitution scores, constrained to have a relative entropy of 0.44 nats in the context of the two sequences being compared (mode C). (a) The aravind103 test set was compared to a yeast protein sequence database that had been edited to remove extra copies of highly similar sequences [29]. (b) A subset of 3586 sequences from the astral40 data set [30, 31] was used as queries against astral40; all self-comparisons were excluded.

See this image and copyright information in PMC

Comment in

Identifying protein interactions.
Appella E, Anderson CW. Appella E, et al. FEBS J. 2005 Oct;272(20):5099-100. doi: 10.1111/j.1742-4658.2005.04944.x. FEBS J. 2005. PMID: 16218943 No abstract available.

Cited by

Biocatalytic sulfation of aromatic and aliphatic alcohols catalyzed by arylsulfate sulfotransferases.
Oroz-Guinea I, Rath M, Tischler I, Ditrich K, Schachtschabel D, Breuer M, Kroutil W. Oroz-Guinea I, et al. Appl Microbiol Biotechnol. 2024 Nov 19;108(1):520. doi: 10.1007/s00253-024-13354-5. Appl Microbiol Biotechnol. 2024. PMID: 39560778 Free PMC article.
Exploring the Siderophore Portfolio for Mass Spectrometry-Based Diagnosis of Scedosporiosis and Lomentosporiosis.
Houšt' J, Palyzová A, Pluháček T, Novák J, Marešová H, Hubáček P, Dobiáš R, Stevens DA, Guegan H, Gangneux JP, Havlíček V. Houšt' J, et al. ACS Omega. 2024 Oct 23;9(44):44815-44824. doi: 10.1021/acsomega.4c08257. eCollection 2024 Nov 5. ACS Omega. 2024. PMID: 39524635 Free PMC article.
Expression and characterization of pantothenate energy-coupling factor transporters as an anti-infective drug target.
Shams A, Bousis S, Diamanti E, Elgaher WAM, Zeimetz L, Haupenthal J, Slotboom DJ, Hirsch AKH. Shams A, et al. Protein Sci. 2024 Nov;33(11):e5195. doi: 10.1002/pro.5195. Protein Sci. 2024. PMID: 39473025 Free PMC article.
Structure and dimerization properties of the plant-specific copper chaperone CCH.
Dluhosch D, Kersten LS, Schott-Verdugo S, Hoppen C, Schwarten M, Willbold D, Gohlke H, Groth G. Dluhosch D, et al. Sci Rep. 2024 Aug 17;14(1):19099. doi: 10.1038/s41598-024-69532-y. Sci Rep. 2024. PMID: 39154065 Free PMC article.
Characterization and transmission of plasmid-mediated multidrug resistance in foodborne Vibrio parahaemolyticus.
Zhou H, Lu Z, Liu X, Bie X, Cui X, Wang Z, Sun X, Yang J. Zhou H, et al. Front Microbiol. 2024 Jul 31;15:1437660. doi: 10.3389/fmicb.2024.1437660. eCollection 2024. Front Microbiol. 2024. PMID: 39144225 Free PMC article.

See all "Cited by" articles

References

1. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48:443–53. - PubMed
1. McLachlan AD. Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c 551. J Mol Biol. 1971;61:409–24. - PubMed
1. Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. C. (1978) A model of evolutionary change in proteins in Atlas of Protein Sequence and Structure (Dayhoff, M. O., ed) pp. 345–52, Natl Biomed Res Found, Washington, DC.
1. Schwartz, R. M. & Dayhoff, M. O. (1978) Matrices for detecting distant relationships in Atlas of Protein Sequence and Structure (Dayhoff, M. O., ed) pp. 353–58, Natl Biomed Res Found, Washington, DC.
1. Feng DF, Johnson MS, Doolittle RF. Aligning amino acid sequences: comparison of commonly used methods. J Mol Evol. 1984;21:112–25. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

Z01 LM000072-10/LM/NLM NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Protein database searches using compositionally adjusted substitution matrices

Affiliation

Protein database searches using compositionally adjusted substitution matrices

Authors

Affiliation

Abstract

Figures

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials