Abstract
Background: Hotspots are those residues that contribute major free energy of binding in protein-protein interactions. Protein functions are frequently dependent on hotspot residues. At present, hotspot residues are always identified by Alanine scanning mutagenesis technology, which is costly, time-consuming and laborious.
Objective: Therefore, more accurate and efficient methods have to be developed to identify protein hotspot residues.
Methods: This paper proposed a novel encoding schema of sequence-segment neighbors and constructed a random forest-based model to identify hotspots in protein interaction interfaces. Firstly, 10 amino acid physicochemical properties, 16 features related to the PI and DI, and 25 features related to ASA were extracted. Different from the previous residue encoding schemas, such as auto correlation descriptor or triplet combination information, this paper employed the influence of amino acids neighbors to hotspot residues and amino acids with a certain distance in sequence to the hotspot.
Results: Moreover, the proposed model was compared with other hotspot prediction methods, including APIS, Robetta, FOLDEF, KFC, MINERVA models, etc.
Conclusion: The experimental results showed that the proposed model can improve the prediction ability of protein hotspot residues on the same test set.
Keywords: Protein interaction, hotspots, encoding of sequence-segment neighbors, sliding window, random forest, schema.
Graphical Abstract
[http://dx.doi.org/10.1126/science.287.5456.1279] [PMID: 10678837]
[http://dx.doi.org/10.1016/0092-8674(94)90191-0] [PMID: 7954790]
[http://dx.doi.org/10.1021/bi00176a016] [PMID: 8130199]
[http://dx.doi.org/10.1093/bioinformatics/17.3.284] [PMID: 11294795]
[PMID: 19768686]
[http://dx.doi.org/10.1093/bioinformatics/btp058] [PMID: 19179356]
[http://dx.doi.org/10.1021/pr050118k] [PMID: 16212412]
[http://dx.doi.org/10.1126/science.278.5340.1125] [PMID: 9353194]
[http://dx.doi.org/10.1007/978-981-10-4337-6]
[http://dx.doi.org/10.1073/pnas.202485799] [PMID: 12381794]
[http://dx.doi.org/10.1016/S0022-2836(02)00442-4] [PMID: 12079393]
[http://dx.doi.org/10.1002/prot.21474] [PMID: 17554779]
[http://dx.doi.org/10.1093/nar/gkp132] [PMID: 19273533]
[http://dx.doi.org/10.1186/1471-2105-11-174] [PMID: 20377884]
[http://dx.doi.org/10.1093/bioinformatics/btg163] [PMID: 12874065]
[http://dx.doi.org/10.1002/prot.24278] [PMID: 23504705]
[http://dx.doi.org/10.1093/bioinformatics/18.7.980] [PMID: 12117796]
[http://dx.doi.org/10.1093/bioinformatics/19.2.313] [PMID: 12538266]
[http://dx.doi.org/10.1016/j.csda.2007.08.015]
[http://dx.doi.org/10.1186/1472-6807-8-21] [PMID: 18400099]
[http://dx.doi.org/10.1371/journal.pone.0039308] [PMID: 22720092]
[http://dx.doi.org/10.1021/pr1007152] [PMID: 20973568]
[http://dx.doi.org/10.1093/bioinformatics/bth466] [PMID: 15308540]
[http://dx.doi.org/10.1007/s00726-017-2474-6] [PMID: 28766075]
[http://dx.doi.org/10.1002/prot.23094] [PMID: 21735484]