Abstract
Background: The identification of DNA binding proteins (DBP) is an important research field. Experiment-based methods are time-consuming and labor-intensive for detecting DBP.
Objective: To solve the problem of large-scale DBP identification, some machine learning methods are proposed. However, these methods have insufficient predictive accuracy. Our aim is to develop a sequence- based machine learning model to predict DBP.
Methods: In our study, we extracted six types of features (including NMBAC, GE, MCD, PSSM-AB, PSSM-DWT, and PsePSSM) from protein sequences. We used Multiple Kernel Learning based on Hilbert- Schmidt Independence Criterion (MKL-HSIC) to estimate the optimal kernel. Then, we constructed a hypergraph model to describe the relationship between labeled and unlabeled samples. Finally, Laplacian Support Vector Machines (LapSVM) is employed to train the predictive model. Our method is tested on PDB186, PDB1075, PDB2272 and PDB14189 data sets.
Results: Compared with other methods, our model achieved best results on benchmark data sets.
Conclusion: The accuracy of 87.1% and 74.2% are achieved on PDB186 (Independent test of PDB1075) and PDB2272 (Independent test of PDB14189), respectively.
Keywords: DNA-binding proteins, feature extraction, laplacian support vector machine, multiple kernel learning, hypergraph learning, PDB.
Graphical Abstract
[http://dx.doi.org/10.1093/bib/bby104] [PMID: 30403770]
[PMID: 31697319]
[http://dx.doi.org/10.1016/j.asoc.2020.106596]
[http://dx.doi.org/10.1021/acs.jproteome.9b00250] [PMID: 31136183]
[http://dx.doi.org/10.1016/j.neucom.2018.10.028]
[http://dx.doi.org/10.1016/j.knosys.2020.106254]
[http://dx.doi.org/10.1155/2020/4675395] [PMID: 32596314]
[http://dx.doi.org/10.1007/s00521-019-04569-z]
[http://dx.doi.org/10.1371/journal.pone.0086703] [PMID: 24475169]
[http://dx.doi.org/10.1016/j.jmb.2004.05.058] [PMID: 15312763]
[http://dx.doi.org/10.1093/nar/gki949] [PMID: 16284202]
[http://dx.doi.org/10.1073/pnas.0707684105] [PMID: 18165317]
[http://dx.doi.org/10.1093/bioinformatics/btq019] [PMID: 20089514]
[http://dx.doi.org/10.1016/j.jmb.2009.02.023] [PMID: 19233205]
[http://dx.doi.org/10.1371/journal.pone.0024756] [PMID: 21935457]
[http://dx.doi.org/10.1002/minf.201400025] [PMID: 27490858]
[http://dx.doi.org/10.1016/S1570-9639(03)00112-2] [PMID: 12758155]
[http://dx.doi.org/10.2174/092986612799789404] [PMID: 22316304]
[http://dx.doi.org/10.1021/acs.jproteome.9b00226] [PMID: 31267738]
[http://dx.doi.org/10.1186/1471-2105-8-463] [PMID: 18042272]
[http://dx.doi.org/10.1038/srep15479] [PMID: 26482832]
[http://dx.doi.org/10.1016/j.ins.2016.06.026]
[http://dx.doi.org/10.1186/1471-2105-15-S15-S9] [PMID: 25474679]
[http://dx.doi.org/10.1016/j.jtbi.2009.07.017] [PMID: 19631664]
[http://dx.doi.org/10.1023/A:1007091128394] [PMID: 11043931]
[http://dx.doi.org/10.1109/TCBB.2010.93] [PMID: 20855926]
[http://dx.doi.org/10.1093/nar/25.17.3389] [PMID: 9254694]
[http://dx.doi.org/10.1162/NECO_a_00537] [PMID: 24102126]
[http://dx.doi.org/10.1007/BF00994018]
[http://dx.doi.org/10.1145/1961189.1961199]
[http://dx.doi.org/10.1371/journal.pone.0106691] [PMID: 25184541]
[http://dx.doi.org/10.1080/07391102.2009.10507281] [PMID: 19385697]
[http://dx.doi.org/10.1186/1752-0509-9-S1-S10] [PMID: 25708928]
[http://dx.doi.org/10.3390/genes9080394] [PMID: 30071697]
[http://dx.doi.org/10.1016/j.jtbi.2018.05.006] [PMID: 29753757]