Abstract
Background and Objective: DNA-binding proteins play important roles in a variety of biological processes, such as gene transcription and regulation, DNA replication and repair, DNA recombination and packaging, and the formation of chromatin and ribosomes. Therefore, it is urgent to develop a computational method to improve the recognition efficiency of DNA-binding proteins.
Methods: We proposed a novel method, DBP-PSSM, which constructed the features from amino acid composition and evolutionary information of protein sequences. The maximum relevance, minimum redundancy (mRMR) was employed to select the optimal features for establishing the XGBoost classifier, therefore, the novel model of prediction DNA-binding proteins, DBP-PSSM, was established with 5-fold cross-validation on the training dataset.
Results: DBP-PSSM achieved an accuracy of 81.18% and MCC of 0.657 in a test dataset, which outperformed the many existing methods. These results demonstrated that our method can effectively predict DNA-binding proteins.
Conclusion: The data and source code are provided at https://github.com/784221489/DNA-binding.
Keywords: DNA-binding proteins, Local_DPP, PSSM400, sliding window and smoothing window, mRMR, XGBoost.
Graphical Abstract
[http://dx.doi.org/10.1371/journal.pcbi.1000567] [PMID: 19911048]
[http://dx.doi.org/10.1007/s00216-010-4096-7] [PMID: 20730525]
[PMID: 8601471]
[http://dx.doi.org/10.1107/S2053230X15004112] [PMID: 25849502]
[http://dx.doi.org/10.1038/nbt1486] [PMID: 18846087]
[http://dx.doi.org/10.1016/B978-0-12-411637-5.00003-2] [PMID: 23790211]
[http://dx.doi.org/10.1016/j.jmb.2004.05.058] [PMID: 15312763]
[http://dx.doi.org/10.1093/bioinformatics/btq295] [PMID: 20525822]
[http://dx.doi.org/10.1186/1471-2105-15-S12-S4] [PMID: 25474071]
[http://dx.doi.org/10.1038/s41598-017-14945-1] [PMID: 29097781]
[http://dx.doi.org/10.1016/S1570-9639(03)00112-2] [PMID: 12758155]
[http://dx.doi.org/10.1186/1471-2105-8-463] [PMID: 18042272]
[http://dx.doi.org/10.1007/s00726-007-0568-2] [PMID: 17624492]
[http://dx.doi.org/10.1080/07391102.2009.10507281] [PMID: 19385697]
[http://dx.doi.org/10.1371/journal.pone.0024756] [PMID: 21935457]
[http://dx.doi.org/10.1186/1471-2105-14-90] [PMID: 23497329]
[http://dx.doi.org/10.1371/journal.pone.0106691] [PMID: 25184541]
[http://dx.doi.org/10.1109/BIBM.2015.7359730]
[http://dx.doi.org/10.1002/minf.201400025] [PMID: 27490858]
[http://dx.doi.org/10.1371/journal.pone.0167345] [PMID: 27907159]
[http://dx.doi.org/10.1016/j.ins.2016.06.026]
[http://dx.doi.org/10.3390/ijms18091856] [PMID: 28841194]
[http://dx.doi.org/10.3390/genes9080394] [PMID: 30071697]
[http://dx.doi.org/10.1093/bioinformatics/bty653] [PMID: 30032213]
[http://dx.doi.org/10.1002/minf.202000006] [PMID: 32144887]
[http://dx.doi.org/10.1155/2020/1384749] [PMID: 32300371]
[http://dx.doi.org/10.1371/journal.pone.0225317] [PMID: 31725778]
[http://dx.doi.org/10.1007/s00726-007-0016-3] [PMID: 18175049]
[http://dx.doi.org/10.1186/1752-0509-9-S1-S10] [PMID: 25708928]
[http://dx.doi.org/10.1109/TCBB.2019.2893634] [PMID: 30668479]
[http://dx.doi.org/10.1371/journal.pone.0086703] [PMID: 24475169]
[http://dx.doi.org/10.1093/bioinformatics/btg432] [PMID: 14990443]
[http://dx.doi.org/10.1109/TNB.2018.2842219] [PMID: 29993553]
[http://dx.doi.org/10.1145/2939672.2939785]
[http://dx.doi.org/10.1007/s00726-010-0639-7] [PMID: 20549269]
[http://dx.doi.org/10.1109/TPAMI.2005.159] [PMID: 16119262]
[http://dx.doi.org/10.1093/bioinformatics/btq003] [PMID: 20053844]
[http://dx.doi.org/10.1093/nar/25.17.3389] [PMID: 9254694]
[http://dx.doi.org/10.1186/s12859-018-2527-1] [PMID: 30598073]