Identification of DNA-Binding Proteins via Hypergraph Based Laplacian
Support Vector Machine

Yuqing      Qian; Hao      Meng; Weizhong      Lu; Zhijun      Liao; Yijie      Ding; Hongjie      Wu

doi:10.2174/1574893616666210806091922

Abstract

Background: The identification of DNA binding proteins (DBP) is an important research field. Experiment-based methods are time-consuming and labor-intensive for detecting DBP.

Objective: To solve the problem of large-scale DBP identification, some machine learning methods are proposed. However, these methods have insufficient predictive accuracy. Our aim is to develop a sequence- based machine learning model to predict DBP.

Methods: In our study, we extracted six types of features (including NMBAC, GE, MCD, PSSM-AB, PSSM-DWT, and PsePSSM) from protein sequences. We used Multiple Kernel Learning based on Hilbert- Schmidt Independence Criterion (MKL-HSIC) to estimate the optimal kernel. Then, we constructed a hypergraph model to describe the relationship between labeled and unlabeled samples. Finally, Laplacian Support Vector Machines (LapSVM) is employed to train the predictive model. Our method is tested on PDB186, PDB1075, PDB2272 and PDB14189 data sets.

Results: Compared with other methods, our model achieved best results on benchmark data sets.

Conclusion: The accuracy of 87.1% and 74.2% are achieved on PDB186 (Independent test of PDB1075) and PDB2272 (Independent test of PDB14189), respectively.

Keywords: DNA-binding proteins, feature extraction, laplacian support vector machine, multiple kernel learning, hypergraph learning, PDB.

« Previous Next »

Graphical Abstract

[1] 
Xiangxiang Z, Li L, Linyuan L, et al. Prediction of potential disease-associated microRNAs using structural perturbation method. Bioinformatics  2018; 34(14): 2425-32.
[2] 
A YD, B LJ, C JTB. Identification of human microRNA-disease association via hypergraph embedded bipartite local model. Comput Biol Chem 2020.
[3] 
Cangzhi J, Yun Z, Quan Z. O-GlcNAcPRED-II: an integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique. Bioinformatics  (12): 12.
[4] 
Leyi  Wei . Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species. Bioinformatics 2018.
[5] 
Quan  Zou . Gene2vec: Gene subsequence embedding for prediction of mammalian n6-methyladenosine sites from mRNA. RNA 2018.
[6] 
Liu B, Jiang S, Zou Q. HITS-PR-HHblits: protein remote homology detection by combining PageRank and Hyperlink-Induced Topic Search. Brief Bioinform 2018.
[http://dx.doi.org/10.1093/bib/bby104] [PMID:  30403770] 
[7] 
Yijie  Ding . Protein crystallization identification via fuzzy model on linear neighborhood representation. IEEE/ACM Trans Comput Biol Bioinformatics 2019.
[8] 
Yubo Wang . CrystalM: a multi-view fusion approach for protein crystallization prediction. IEEE/ACM Trans Comput Biol Bioinformatics 2019.
[9] 
Wang H, Ding Y, Tang J, et al. Identification of membrane protein types via multivariate information fusion with Hilbert–Schmidt Independence Criterion. Neurocomputing  2019; 383.
[10] 
Shen Y, Ding Y, Tang J, et al. Critical evaluation of web-based prediction tools for human protein subcellular localization. Brief Bioinform  2019; (5): 5.
[PMID:  31697319] 
[11] 
Ding Y, Tang J, Guo F. Human protein subcellular localization identification via fuzzy model on kernelized neighborhood representation. Appl Soft Comput  2020; 96106596
[http://dx.doi.org/10.1016/j.asoc.2020.106596] 
[12] 
Ru X, Li L, Zou Q. Incorporating distance-based top-n-gram and random forest to identify electron transport proteins. J Proteome Res  2019; 18(7): 2931-9.
[http://dx.doi.org/10.1021/acs.jproteome.9b00250] [PMID:  31136183] 
[13] 
Ding Y, Tang J, Guo F. Identification of drug-side effect association via multiple information integration with centered kernel alignment. Neurocomputing  2019; 325(24): 211-24.
[http://dx.doi.org/10.1016/j.neucom.2018.10.028] 
[14] 
Ding Y, Tang J, Guo F. Identification of drug-side effect association via semi-supervised model and multiple kernel learning. IEEE J Biomed Health Inform  2018; 1-1.
[15] 
Ding Y, Tang J, Guo F. Identification of drug–target interactions via dual laplacian regularized least squares with multiple kernel fusion. Knowl Base Syst  2020; 204106254
[http://dx.doi.org/10.1016/j.knosys.2020.106254] 
[16] 
Guo X, Zhou W, Yu Y, Ding Y, Tang J, Guo F. A novel triple matrix factorization method for detecting drug-side effect association based on kernel target alignment. BioMed Res Int  2020; 2020(1)4675395
[http://dx.doi.org/10.1155/2020/4675395] [PMID:  32596314] 
[17] 
Ding Y, Tang J, Guo F. Identification of drug–target interactions via fuzzy bipartite local model. Neural Comput Appl  2020; 32(D1): 1-17.
[http://dx.doi.org/10.1007/s00521-019-04569-z] 
[18] 
Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H. Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naïve Bayes. PLoS One  2014; 9(1)e86703
[http://dx.doi.org/10.1371/journal.pone.0086703] [PMID:  24475169] 
[19] 
Ahmad S, Sarai A. Moment-based prediction of DNA-binding proteins. J Mol Biol  2004; 341(1): 65-71.
[http://dx.doi.org/10.1016/j.jmb.2004.05.058] [PMID:  15312763] 
[20] 
Bhardwaj N, Langlois RE, Zhao G, Lu H. Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucleic Acids Res  2005; 33(20): 6486-93.
[http://dx.doi.org/10.1093/nar/gki949] [PMID:  16284202] 
[21] 
Brylinski M, Skolnick J. A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation. Proc Natl Acad Sci USA  2008; 105(1): 129-34.
[http://dx.doi.org/10.1073/pnas.0707684105] [PMID:  18165317] 
[22] 
Nimrod G, Schushan M, Szilágyi A, Leslie C, Ben-Tal N. iDBPs: a web server for the identification of DNA binding proteins. Bioinformatics  2010; 26(5): 692-3.
[http://dx.doi.org/10.1093/bioinformatics/btq019] [PMID:  20089514] 
[23] 
Jodavi M, Abadi M, Parhizkar E. of Conference. DbDHunter: An ensemble-based anomaly detection approach to detect drive-by download attacks //; City. 
[24] 
Nimrod G, Szilágyi A, Leslie C, Ben-Tal N. Identification of DNA-binding proteins using structural, electrostatic and evolutionary features. J Mol Biol  2009; 387(4): 1040-53.
[http://dx.doi.org/10.1016/j.jmb.2009.02.023] [PMID:  19233205] 
[25] 
Lin WZ, Fang JA, Xiao X, Chou KC. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS One  2011; 6(9)e24756
[http://dx.doi.org/10.1371/journal.pone.0024756] [PMID:  21935457] 
[26] 
Liu B, Xu J, Fan S, Xu R, Zhou J, Wang X. PseDNA-Pro: DNA-binding protein identification by combining chou’s pseaac and physicochemical distance transformation. Mol Inform  2015; 34(1): 8-17.
[http://dx.doi.org/10.1002/minf.201400025] [PMID:  27490858] 
[27] 
Cai YD, Lin SL. Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim Biophys Acta  2003; 1648(1-2): 127-33.
[http://dx.doi.org/10.1016/S1570-9639(03)00112-2] [PMID:  12758155] 
[28] 
Zhao XW, Li XT, Ma ZQ, Yin MH. Identify DNA-binding proteins with optimal Chou’s amino acid composition. Protein Pept Lett  2012; 19(4): 398-405.
[http://dx.doi.org/10.2174/092986612799789404] [PMID:  22316304] 
[29] 
Du X, Diao Y, Liu H, Li S. MsDBP: Exploring DNA-binding proteins by integrating multiscale sequence information via chou’s five-step rule. J Proteome Res  2019; 18(8): 3119-32.
[http://dx.doi.org/10.1021/acs.jproteome.9b00226] [PMID:  31267738] 
[30] 
Kumar M, Gromiha MM, Raghava GP. Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics  2007; 8(1): 463.
[http://dx.doi.org/10.1186/1471-2105-8-463] [PMID:  18042272] 
[31] 
Liu B, Wang S, Wang X. DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Sci Rep  2015; 5: 15479.
[http://dx.doi.org/10.1038/srep15479] [PMID:  26482832] 
[32] 
Wei L, Tang J, Quan Z. Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information. Inf Sci  2016; 384: 135-44.
[http://dx.doi.org/10.1016/j.ins.2016.06.026] 
[33] 
Ding Y, Chen F, Guo X, et al. Identification of DNA-binding proteins by multiple kernel support vector machine and sequence information. Curr Proteomics  2019; 16.
[34] 
Zou Y, Ding Y, Tang J, et al.  FKRR-MVSF: A fuzzy kernel ridge regression model for identifying DNA-binding proteins by multiview sequence features via chou's five-step rule. International Journal of Molecular ences  2019; 20(17): 4175.
[35] 
Guo X, Zhou W, Shi B, et al. An efficient multiple kernel support vector regression model for assessing dry weight of hemodialysis patients. Curr Bioinform  2020; 15.
[36] 
You ZH, Zhu L, Zheng CH, Yu HJ, Deng SP, Ji Z. Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set. BMC Bioinformatics  2014; 15(S15)(Suppl. 15): S9.
[http://dx.doi.org/10.1186/1471-2105-15-S15-S9] [PMID:  25474679] 
[37] 
Li X, Liao B, Shu Y, Zeng Q, Luo J. Protein functional class prediction using global encoding of amino acid sequence. J Theor Biol  2009; 261(2): 290-3.
[http://dx.doi.org/10.1016/j.jtbi.2009.07.017] [PMID:  19631664] 
[38] 
Feng ZP, Zhang CT. Prediction of membrane protein types based on the hydrophobic index of amino acids. J Protein Chem  2000; 19(4): 269-75.
[http://dx.doi.org/10.1023/A:1007091128394] [PMID:  11043931] 
[39] 
Jeong JC, Lin X, Chen XW. On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans Comput Biol Bioinformatics  2011; 8(2): 308-15.
[http://dx.doi.org/10.1109/TCBB.2010.93] [PMID:  20855926] 
[40] 
Altschul SF, Madden TL, Schäffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res  1997; 25(17): 3389-402.
[http://dx.doi.org/10.1093/nar/25.17.3389] [PMID:  9254694] 
[41] 
Yamada M, Jitkrittum W, Sigal L, Xing EP, Sugiyama M. High-dimensional feature selection by feature-wise kernelized Lasso. Neural Comput  2014; 26(1): 185-207.
[http://dx.doi.org/10.1162/NECO_a_00537] [PMID:  24102126] 
[42] 
Gretton A, Bousquet O, Smola A.  et al of Conference.. Measuring statistical dependence with hilbert-schmidt norms //; city. 
[43] 
A HW, B YD, D JTAC. Identification of membrane protein types via multivariate information fusion with Hilbert–schmidt independence criterion. Neurocomputing  2020; 383: 257-69.
[44] 
Belkin MNP, Sindhwani V. Manifold regularization:a geometric framework for labeled and unlabeled examples. J Mach Learn Res  2006; 7(3): 2399-434.
[45] 
Cortes C, Vapnik V. Support-Vector Networks. Mach Learn  1995; 20(3): 273-97.
[http://dx.doi.org/10.1007/BF00994018] 
[46] 
Chang CC, Lin CJ. LIBSVM: A library for support vector machines ACM 2011 M.
[http://dx.doi.org/10.1145/1961189.1961199] 
[47] 
Zhou D, Huang J, Schlkopf B. of Conference. Learning with hypergraphs: clustering, classification, and embedding //; city. 
[48] 
Liu B, Xu J, Lan X, et al. iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS One  2014; 9(9)e106691
[http://dx.doi.org/10.1371/journal.pone.0106691] [PMID:  25184541] 
[49] 
Kumar KK, Pugalenthi G, Suganthan PN. DNA-Prot: identification of DNA binding proteins from protein sequence information using random forest. J Biomol Struct Dyn  2009; 26(6): 679-86.
[http://dx.doi.org/10.1080/07391102.2009.10507281] [PMID:  19385697] 
[50] 
Xu R, Zhou J, Wang H, He Y, Wang X, Liu B. Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation. BMC Syst Biol  2015; 9(1)(Suppl. 1): S10.
[http://dx.doi.org/10.1186/1752-0509-9-S1-S10] [PMID:  25708928] 
[51] 
Liu XJ, Gong XJ, Yu H, Xu JH. A model stacking framework for identifying DNA binding proteins by orchestrating multi-view features and classifiers. Genes (Basel)  2018; 9(8): 394.
[http://dx.doi.org/10.3390/genes9080394] [PMID:  30071697] 
[52] 
Rahman MS, Shatabda S, Saha S, Kaykobad M, Rahman MS. DPP-PseAAC: A DNA-binding protein prediction model using Chou’s general PseAAC. J Theor Biol  2018; 452: 22-34.
[http://dx.doi.org/10.1016/j.jtbi.2018.05.006] [PMID:  29753757] 

Rights & Permissions Print Cite

Article Metrics

27

4

1

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893616666210806091922	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

Identification of DNA-Binding Proteins via Hypergraph Based Laplacian Support Vector Machine

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract