DBP-PSSM: Combination of Evolutionary Profiles with the XGBoost Algorithm to Improve the Identification of DNA-binding Proteins

Yanping       Zhang; Pengcheng       Chen; Ya       Gao; Jianwei       Ni; Xiaosheng       Wang

doi:10.2174/1386207323999201124203531

Abstract

Background and Objective: DNA-binding proteins play important roles in a variety of biological processes, such as gene transcription and regulation, DNA replication and repair, DNA recombination and packaging, and the formation of chromatin and ribosomes. Therefore, it is urgent to develop a computational method to improve the recognition efficiency of DNA-binding proteins.

Methods: We proposed a novel method, DBP-PSSM, which constructed the features from amino acid composition and evolutionary information of protein sequences. The maximum relevance, minimum redundancy (mRMR) was employed to select the optimal features for establishing the XGBoost classifier, therefore, the novel model of prediction DNA-binding proteins, DBP-PSSM, was established with 5-fold cross-validation on the training dataset.

Results: DBP-PSSM achieved an accuracy of 81.18% and MCC of 0.657 in a test dataset, which outperformed the many existing methods. These results demonstrated that our method can effectively predict DNA-binding proteins.

Conclusion: The data and source code are provided at https://github.com/784221489/DNA-binding.

Keywords: DNA-binding proteins, Local_DPP, PSSM400, sliding window and smoothing window, mRMR, XGBoost.

« Previous Next »

Graphical Abstract

[1] 
Gao, M.; Skolnick, J. A threading-based method for the prediction of DNA-binding proteins with application to the human genome. PLOS Comput. Biol.,  2009, 5(11)
[http://dx.doi.org/10.1371/journal.pcbi.1000567] [PMID: 19911048] 
[2] 
Helwa, R.; Hoheisel, J.D. Analysis of DNA-protein interactions: from nitrocellulose filter binding assays to microarray studies. Anal. Bioanal. Chem.,  2010, 398(6), 2551-2561.
[http://dx.doi.org/10.1007/s00216-010-4096-7] [PMID: 20730525] 
[3] 
Freeman, K.; Gwadz, M.; Shore, D. Molecular and genetic analysis of the toxic effect of RAP1 overexpression in yeast. Genetics,  1995, 141(4), 1253-1262.
[PMID: 8601471] 
[4] 
Jaiswal, R.; Singh, S.K.; Bastia, D.; Escalante, C.R. Crystallization and preliminary X-ray characterization of the eukaryotic replication terminator Reb1-Ter DNA complex. Acta Crystallogr. F Struct. Biol. Commun.,  2015, 71(Pt 4), 414-418.
[http://dx.doi.org/10.1107/S2053230X15004112] [PMID: 25849502] 
[5] 
Shendure, J.; Ji, H. Next-generation DNA sequencing. Nat. Biotechnol.,  2008, 26(10), 1135-1145.
[http://dx.doi.org/10.1038/nbt1486] [PMID: 18846087] 
[6] 
Gromiha, M.M.; Nagarajan, R. Computational approaches for predicting the binding sites and understanding the recognition mechanism of protein-DNA complexes. Adv. Protein Chem. Struct. Biol.,  2013, 91, 65-99.
[http://dx.doi.org/10.1016/B978-0-12-411637-5.00003-2] [PMID: 23790211] 
[7] 
Ahmad, S.; Sarai, A. Moment-based prediction of DNA-binding proteins. J. Mol. Biol.,  2004, 341(1), 65-71.
[http://dx.doi.org/10.1016/j.jmb.2004.05.058] [PMID: 15312763] 
[8] 
Zhao, H.; Yang, Y.; Zhou, Y. Structure-based prediction of DNA-binding proteins by structural alignment and a volume-fraction corrected DFIRE-based energy function. Bioinformatics,  2010, 26(15), 1857-1863.
[http://dx.doi.org/10.1093/bioinformatics/btq295] [PMID: 20525822] 
[9] 
Wang, W.; Liu, J.; Zhou, X. Identification of single-stranded and double-stranded DNA binding proteins based on protein structure. BMC Bioinformatics,  2014, 15(Suppl. 12), S4.
[http://dx.doi.org/10.1186/1471-2105-15-S12-S4] [PMID: 25474071] 
[10] 
Chowdhury, S.Y.; Shatabda, S.; Dehzangi, A. iDNAProt-ES: identification of DNA-binding proteins using evolutionary and structural features. Sci. Rep.,  2017, 7(1), 14938.
[http://dx.doi.org/10.1038/s41598-017-14945-1] [PMID: 29097781] 
[11] 
Cai, Y.D.; Lin, S.L. Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim. Biophys. Acta,  2003, 1648(1-2), 127-133.
[http://dx.doi.org/10.1016/S1570-9639(03)00112-2] [PMID: 12758155] 
[12] 
Kumar, M.; Gromiha, M.M.; Raghava, G.P.S. Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics,  2007, 8, 463.
[http://dx.doi.org/10.1186/1471-2105-8-463] [PMID: 18042272] 
[13] 
Fang, Y.; Guo, Y.; Feng, Y.; Li, M. Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features. Amino Acids,  2008, 34(1), 103-109.
[http://dx.doi.org/10.1007/s00726-007-0568-2] [PMID: 17624492] 
[14] 
Kumar, K.K.; Pugalenthi, G.; Suganthan, P.N. DNA-Prot: identification of DNA binding proteins from protein sequence information using random forest. J. Biomol. Struct. Dyn.,  2009, 26(6), 679-686.
[http://dx.doi.org/10.1080/07391102.2009.10507281] [PMID: 19385697] 
[15] 
Lin, W.Z.; Fang, J.A.; Xiao, X.; Chou, K.C. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS One,  2011, 6(9)e24756
[http://dx.doi.org/10.1371/journal.pone.0024756] [PMID: 21935457] 
[16] 
Zou, C.; Gong, J.; Li, H. An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis. BMC Bioinformatics,  2013, 14, 90.
[http://dx.doi.org/10.1186/1471-2105-14-90] [PMID: 23497329] 
[17] 
Liu, B.; Xu, J.; Lan, X.; Xu, R.; Zhou, J.; Wang, X.; Chou, K.C. iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS One,  2014, 9(9)e106691
[http://dx.doi.org/10.1371/journal.pone.0106691] [PMID: 25184541] 
[18] 
Dong, Q.; Wang, S.; Wang, K.; Liu, X.; Liu, B. Identification of DNA-binding proteins by auto-cross covariance transformation. Proceedings of the 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM),  Washington, DC, USA2015, pp. 470-475.
[http://dx.doi.org/10.1109/BIBM.2015.7359730] 
[19] 
Liu, B.; Xu, J.; Fan, S.; Xu, R.; Zhou, J.; Wang, X. PseDNA-Pro: DNA-binding protein identification by combining Chou’s PseAAC and physicochemical distance transformation. Mol. Inform.,  2015, 34(1), 8-17.
[http://dx.doi.org/10.1002/minf.201400025] [PMID: 27490858] 
[20] 
Ma, X.; Guo, J.; Sun, X. DNABP: identification of DNA-binding proteins based on feature selection using a random Forest and predicting binding residues. PLoS One,  2016, 11(12)e0167345
[http://dx.doi.org/10.1371/journal.pone.0167345] [PMID: 27907159] 
[21] 
Wei, L.Y.; Tang, J.J.; Zou, Q. Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information. Inf. Sci.,  2017, 384, 135-144.
[http://dx.doi.org/10.1016/j.ins.2016.06.026] 
[22] 
Zhang, J.; Liu, B. PSFM-DBT: identifying DNA-binding proteins by combing position specific frequency matrix and distance-bigram transformation. Int. J. Mol. Sci.,  2017, 18(9), 1856.
[http://dx.doi.org/10.3390/ijms18091856] [PMID: 28841194] 
[23] 
Liu, X.J.; Gong, X.J.; Yu, H.; Xu, J.H. A model stacking framework for identifying DNA binding proteins by orchestrating multi-view features and classifiers. Genes (Basel),  2018, 9(8), 394-412.
[http://dx.doi.org/10.3390/genes9080394] [PMID: 30071697] 
[24] 
Mishra, A.; Pokhrel, P.; Hoque, M.T. StackDPPred: a stacking based prediction of DNA-binding protein from sequence. Bioinformatics,  2019, 35(3), 433-441.
[http://dx.doi.org/10.1093/bioinformatics/bty653] [PMID: 30032213] 
[25] 
Zhou, L.; Song, X.; Yu, D.J.; Sun, J. Sequence-based detection of DNA-binding proteins using multiple-view features allied with feature selection. Mol. Inform.,  2020, 39(8)e2000006
[http://dx.doi.org/10.1002/minf.202000006] [PMID: 32144887] 
[26] 
Sang, X.; Xiao, W.; Zheng, H.; Yang, Y.; Liu, T. HMMPred: Accurate prediction of DNA-binding proteins based on HMM Profiles and XGBoost feature selection. Comput. Math. Methods Med.,  2020, 20201384749
[http://dx.doi.org/10.1155/2020/1384749] [PMID: 32300371] 
[27] 
Hu, S.; Ma, R.; Wang, H. An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences. PLoS One,  2019, 14(11)e0225317
[http://dx.doi.org/10.1371/journal.pone.0225317] [PMID: 31725778] 
[28] 
Nanni, L.; Lumini, A. Combing ontologies and dipeptide composition for predicting DNA-binding proteins. Amino Acids,  2008, 34(4), 635-641.
[http://dx.doi.org/10.1007/s00726-007-0016-3] [PMID: 18175049] 
[29] 
Xu, R.; Zhou, J.; Wang, H.; He, Y.; Wang, X.; Liu, B. Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation. BMC Syst. Biol.,  2015, 9(Suppl. 1), S10.
[http://dx.doi.org/10.1186/1752-0509-9-S1-S10] [PMID: 25708928] 
[30] 
Hu, J.; Zhou, X.G.; Zhu, Y.H.; Yu, D.J.; Zhang, G.J. TargetDBP: accurate DNA-binding protein prediction via sequence-based multi-view feature learning IEEE/ACM Trans Comput Biol Bioinform,  2019, 17(4), 1419-1429.
[http://dx.doi.org/10.1109/TCBB.2019.2893634] [PMID: 30668479] 
[31] 
Lou, W.; Wang, X.; Chen, F.; Chen, Y.; Jiang, B.; Zhang, H. Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naïve Bayes. PLoS One,  2014, 9(1)e86703
[http://dx.doi.org/10.1371/journal.pone.0086703] [PMID: 24475169] 
[32] 
Ahmad, S.; Gromiha, M.M.; Sarai, A. Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics,  2004, 20(4), 477-486.
[http://dx.doi.org/10.1093/bioinformatics/btg432] [PMID: 14990443] 
[33] 
Zhong, J.; Sun, Y.; Peng, W.; Xie, M.; Yang, J.; Tang, X. XGBFEMF: An XGBoost-Based Framework for essential protein prediction. IEEE Trans. Nanobioscience,  2018, 17(3), 243-250.
[http://dx.doi.org/10.1109/TNB.2018.2842219] [PMID: 29993553] 
[34] 
Chen, T.Q.; Guestrin, C. XGBoost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining,  2016, pp. 785-794.
[http://dx.doi.org/10.1145/2939672.2939785] 
[35] 
Wang, C.C.; Fang, Y.; Xiao, J.; Li, M. Identification of RNA-binding sites in proteins by integrating various sequence information. Amino Acids,  2011, 40(1), 239-248.
[http://dx.doi.org/10.1007/s00726-010-0639-7] [PMID: 20549269] 
[36] 
Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell.,  2005, 27(8), 1226-1238.
[http://dx.doi.org/10.1109/TPAMI.2005.159] [PMID: 16119262] 
[37] 
Huang, Y.; Niu, B.; Gao, Y.; Fu, L.; Li, W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics,  2010, 26(5), 680-682.
[http://dx.doi.org/10.1093/bioinformatics/btq003] [PMID: 20053844] 
[38] 
Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res.,  1997, 25(17), 3389-3402.
[http://dx.doi.org/10.1093/nar/25.17.3389] [PMID: 9254694] 
[39] 
Deng, L.; Pan, J.; Xu, X.; Yang, W.; Liu, C.; Liu, H. PDRLGB: precise DNA-binding residue prediction using a light gradient boosting machine. BMC Bioinformatics,  2018, 19(Suppl. 19), 522-533.
[http://dx.doi.org/10.1186/s12859-018-2527-1] [PMID: 30598073] 

Rights & Permissions Print Cite

Article Metrics

12

4

1

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1386207323999201124203531	Print ISSN 1386-2073
Publisher Name Bentham Science Publisher	Online ISSN 1875-5402

Combinatorial Chemistry & High Throughput Screening

DBP-PSSM: Combination of Evolutionary Profiles with the XGBoost Algorithm to Improve the Identification of DNA-binding Proteins

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract