Abstract
Background: The information of quaternary structure attributes of proteins is very important because it is closely related to the biological functions of proteins. With the rapid development of new generation sequencing technology, we are facing a challenge: how to automatically identify the four-level attributes of new polypeptide chains according to their sequence information (i.e., whether they are formed as just as a monomer, or as a hetero-oligomer, or a homo-oligomer).
Objective: In this article, our goal is to find a new way to represent protein sequences, thereby improving the prediction rate of protein quaternary structure.
Methods: In this article, we developed a prediction system for protein quaternary structural type in which a protein sequence was expressed by combining the Pfam functional-domain and gene ontology. turn protein features into digital sequences, and complete the prediction of quaternary structure through specific machine learning algorithms and verification algorithm.
Results: Our data set contains 5495 protein samples. Through the method provided in this paper, we classify proteins into monomer, or as a hetero-oligomer, or a homo-oligomer, and the prediction rate is 74.38%, which is 3.24% higher than that of previous studies. Through this new feature extraction method, we can further classify the four-level structure of proteins, and the results are also correspondingly improved.
Conclusion: After the applying the new prediction system, compared with the previous results, we have successfully improved the prediction rate. We have reason to believe that the feature extraction method in this paper has better practicability and can be used as a reference for other protein classification problems.
Keywords: Protein quaternary structure, Pfam, function domain composition, gene ontology, random forest algorithm, Jackknife test.
Graphical Abstract
[http://dx.doi.org/10.1051/vetres/2009076] [PMID: 20003910]
[http://dx.doi.org/10.1007/s00726-008-0086-x] [PMID: 18427713]
[http://dx.doi.org/10.1007/s00726-006-0263-8] [PMID: 16773245]
[http://dx.doi.org/10.1093/bioinformatics/btg331] [PMID: 14668222]
[http://dx.doi.org/10.1093/bioinformatics/17.6.551] [PMID: 11395433]
[http://dx.doi.org/10.1107/S0021889807041076]
[http://dx.doi.org/10.1107/S0021889809002751]
[http://dx.doi.org/10.1007/s11030-010-9227-8] [PMID: 20148364]
[http://dx.doi.org/10.1038/75556] [PMID: 10802651]
[PMID: 14681407]
[http://dx.doi.org/10.1109/tcbb.2015.2462348]
[http://dx.doi.org/10.1093/bioinformatics/btv712] [PMID: 26644414]
[http://dx.doi.org/10.1186/1471-2105-8-235] [PMID: 17605807]
[http://dx.doi.org/10.1089/dna.2015.2923] [PMID: 26154702]
[http://dx.doi.org/10.1093/bioinformatics/btm195] [PMID: 17646340]
[http://dx.doi.org/10.1093/gbe/evq012] [PMID: 20624728]
[http://dx.doi.org/10.1371/journal.pone.0089545] [PMID: 24647341]
[http://dx.doi.org/10.1016/j.jtbi.2014.06.031] [PMID: 24997236]
[http://dx.doi.org/10.1016/j.ab.2014.10.014] [PMID: 25449328]
[http://dx.doi.org/10.1016/j.bbrc.2003.10.062] [PMID: 14623335]
[http://dx.doi.org/10.1016/j.jtbi.2010.12.024] [PMID: 21168420]
[http://dx.doi.org/10.1093/bioinformatics/btl158] [PMID: 16731699]
[http://dx.doi.org/10.2174/157016409789973707]
[http://dx.doi.org/10.1002/prot.1035] [PMID: 11288174]
[http://dx.doi.org/10.1002/prot.10500] [PMID: 14517979]
[http://dx.doi.org/10.1093/nar/25.17.3389] [PMID: 9254694]
[http://dx.doi.org/10.1073/pnas.78.6.3824] [PMID: 6167991]
[http://dx.doi.org/10.1002/cfg.235] [PMID: 18629103]
[PMID: 15089749]
[PMID: 15853267]
[PMID: 28989035]
[http://dx.doi.org/10.1093/bioinformatics/btx711] [PMID: 29106451]
[http://dx.doi.org/10.1371/journal.pone.0018258] [PMID: 21483473]
[http://dx.doi.org/10.1371/journal.pone.0009931] [PMID: 20368981]
[http://dx.doi.org/10.1093/nar/gkj149] [PMID: 16381856]
[http://dx.doi.org/10.1111/1440-1630.12323] [PMID: 27981638]
[http://dx.doi.org/10.1093/nar/gkj079] [PMID: 16381859]
[http://dx.doi.org/10.1093/nar/gkn845] [PMID: 18984618]
[http://dx.doi.org/10.1186/1471-2105-4-41] [PMID: 12969510]
[http://dx.doi.org/10.1093/nar/gku1221] [PMID: 25414356]
[http://dx.doi.org/10.1038/srep39655] [PMID: 28000796]
[http://dx.doi.org/10.1093/bioinformatics/btu711] [PMID: 25348214]
[http://dx.doi.org/10.1371/journal.pcbi.1004509] [PMID: 26575353]
[http://dx.doi.org/10.1007/BF00058655]
[http://dx.doi.org/10.1002/widm.8]
[http://dx.doi.org/10.1007/BF01886884] [PMID: 8561854]
[http://dx.doi.org/10.1016/j.bbrc.2008.08.125] [PMID: 18774775]
[http://dx.doi.org/10.2174/092986610790226085] [PMID: 20214647]
[http://dx.doi.org/10.2174/092986608785133681] [PMID: 18782071]
[http://dx.doi.org/10.1016/j.jtbi.2009.11.016] [PMID: 19961864]
[http://dx.doi.org/10.2174/092986610792231500] [PMID: 20450488]
[http://dx.doi.org/10.1002/prot.1071] [PMID: 11354006]
[http://dx.doi.org/10.1016/j.jtbi.2010.10.026] [PMID: 21040732]
[http://dx.doi.org/10.2174/092986608785849308] [PMID: 18991767]
[http://dx.doi.org/10.1002/prot.10251] [PMID: 12471598]