Abstract
Background: DNA-binding proteins are very important to many biomolecular functions. The traditional experimental methods are expensive and time-consuming, so, computational methods that can predict whether a protein is a DNA-binding protein or not are very helpful to researchers. Machine learning has been widely used in many research areas. Many researchers have proposed machine learning methods for DNA-binding protein prediction, and this paper highlights their advantages and disadvantages.
Objective: There are many computational methods that can predict DNA-binding proteins. Every method uses different features and different classifier algorithms. In this paper, a review of these methods is provided to find out some common procedures that can help researchers to develop more accurate methods.
Methods: Firstly, the information stored in the protein sequence and gene sequence is presented. That information is the basis to find out the patterns leading to binding. Then, feature extraction methods and classifier algorithms are discussed. At last, some commonly used benchmark datasets are analysed and evaluated by methods.
Conclusion: In this review, we analyzed some popular computational methods to predict DNAbinding protein. From those methods, we highlighted many features necessary to build up an accurate DNA-binding protein classifier. This can also help researchers to build up more useful computational tools. Currently, there are some machine learning methods with good performance in predicting DNAbinding proteins. The performance can be improved by using different kinds of features and classifiers.
Keywords: DNA-binding protein, machine learning, feature extraction, PseAAC, DWT, benchmark dataset.
Graphical Abstract
[http://dx.doi.org/10.1038/nprot.2006.98] [PMID: 17406303]
[http://dx.doi.org/10.1073/pnas.72.2.628] [PMID: 1054844]
[http://dx.doi.org/10.1038/srep43597] [PMID: 28240320]
[http://dx.doi.org/10.1101/061978]
[http://dx.doi.org/10.1101/058800]
[http://dx.doi.org/10.1016/S0022-2836(03)00031-7] [PMID: 12589754]
[http://dx.doi.org/10.1093/nar/gkn332] [PMID: 18515839]
[http://dx.doi.org/10.1016/j.jmb.2004.05.058] [PMID: 15312763]
[http://dx.doi.org/10.1093/bioinformatics/btq295] [PMID: 20525822]
[http://dx.doi.org/10.1186/1477-5956-9-S1-S1] [PMID: 22166014]
[http://dx.doi.org/10.1093/nar/gki949] [PMID: 16284202]
[http://dx.doi.org/10.1016/j.febslet.2007.01.086] [PMID: 17316627]
[http://dx.doi.org/10.1371/journal.pcbi.1000567] [PMID: 19911048]
[http://dx.doi.org/10.1016/j.jmb.2006.02.053] [PMID: 16551468]
[http://dx.doi.org/10.1093/bioinformatics/btq019] [PMID: 20089514]
[http://dx.doi.org/10.1080/07391102.2009.10507281] [PMID: 19385697]
[http://dx.doi.org/10.1016/S1570-9639(03)00112-2] [PMID: 12758155]
[http://dx.doi.org/10.1016/j.jtbi.2005.09.018] [PMID: 16274699]
[http://dx.doi.org/10.1016/j.ins.2017.08.045]
[http://dx.doi.org/10.1002/minf.201400025] [PMID: 27490858]
[http://dx.doi.org/10.1093/nar/25.17.3389] [PMID: 9254694]
[http://dx.doi.org/10.1186/1471-2105-8-463] [PMID: 18042272]
[http://dx.doi.org/10.1038/srep15479] [PMID: 26482832]
[http://dx.doi.org/10.1093/bioinformatics/btg432] [PMID: 14990443]
[http://dx.doi.org/10.1186/s12864-015-1419-2] [PMID: 25879410]
[http://dx.doi.org/10.1186/s12859-015-0875-7] [PMID: 26774270]
[http://dx.doi.org/10.1016/j.compbiolchem.2014.08.016] [PMID: 25213854]
[http://dx.doi.org/10.1186/gb-2009-10-10-r108] [PMID: 19814784]
[http://dx.doi.org/10.1016/j.gene.2016.07.010] [PMID: 27393648]
[http://dx.doi.org/10.1101/gr.097261.109] [PMID: 20019144]
[http://dx.doi.org/10.1093/bioinformatics/btu825] [PMID: 25505086]
[http://dx.doi.org/10.1186/gb-2008-9-3-r55] [PMID: 18341692]
[http://dx.doi.org/10.1093/bioinformatics/bti1018]
[http://dx.doi.org/10.1016/S0166-218X(03)00382-2]
[http://dx.doi.org/10.1371/journal.pcbi.1003711] [PMID: 25033408]
[http://dx.doi.org/10.1038/nbt.2515] [PMID: 23475072]
[http://dx.doi.org/10.1093/nar/gkt144] [PMID: 23519616]
[http://dx.doi.org/10.1016/j.jtbi.2015.09.014] [PMID: 26427337]
[http://dx.doi.org/10.1093/bioinformatics/btt727] [PMID: 24371153]
[http://dx.doi.org/10.7717/peerj.1839] [PMID: 27069789]
[http://dx.doi.org/10.1073/pnas.1525116113] [PMID: 27173902]
[http://dx.doi.org/10.1038/srep23934] [PMID: 27030570]
[http://dx.doi.org/10.1007/978-3-319-23826-5_20]
[http://dx.doi.org/10.1038/ng.3430] [PMID: 26569123]
[http://dx.doi.org/10.1038/nbt.2862] [PMID: 24752080]
[http://dx.doi.org/10.1093/bioinformatics/bty059] [PMID: 29474523]
[http://dx.doi.org/10.1002/prot.1035] [PMID: 11288174]
[http://dx.doi.org/10.2174/1875036201307010041]
[http://dx.doi.org/10.1016/j.ab.2017.06.006] [PMID: 28624296]
[http://dx.doi.org/10.1093/bioinformatics/bth466] [PMID: 15308540]
[http://dx.doi.org/10.1016/j.compbiolchem.2010.09.002] [PMID: 21106461]
[http://dx.doi.org/10.1016/j.jtbi.2014.07.003] [PMID: 25026218]
[http://dx.doi.org/10.1007/s00726-007-0568-2] [PMID: 17624492]
[http://dx.doi.org/10.1016/j.jtbi.2007.06.001] [PMID: 17628605]
[http://dx.doi.org/10.1016/j.jtbi.2014.10.014] [PMID: 25452135]
[http://dx.doi.org/10.1002/jcc.21740] [PMID: 21328402]
[http://dx.doi.org/10.2174/092986610792231564] [PMID: 20450487]
[http://dx.doi.org/10.1002/jcc.21616] [PMID: 20652881]
[http://dx.doi.org/10.1016/j.jtbi.2012.06.028] [PMID: 22750634]
[http://dx.doi.org/10.1016/S0022-2836(05)80360-2] [PMID: 2231712]
[http://dx.doi.org/10.1186/s12859-017-1792-8] [PMID: 28851273]
[http://dx.doi.org/10.3390/ijms18091856] [PMID: 28841194]
[http://dx.doi.org/10.1016/j.ins.2016.06.026]
[http://dx.doi.org/10.1093/nar/10.9.2997] [PMID: 7048259]
[http://dx.doi.org/10.1093/nar/gkn1019] [PMID: 19106141]
[http://dx.doi.org/10.1093/bioinformatics/16.1.16] [PMID: 10812473]
[http://dx.doi.org/10.1093/bioinformatics/btl227] [PMID: 16873507]
[http://dx.doi.org/10.1186/s12859-016-1253-9] [PMID: 27677692]
[http://dx.doi.org/10.1371/journal.pone.0185587] [PMID: 28961273]
[http://dx.doi.org/10.1186/1471-2105-13-118] [PMID: 22651691]
[http://dx.doi.org/10.1016/j.jmb.2004.10.055] [PMID: 15544817]
[http://dx.doi.org/10.1006/jmbi.1999.3091] [PMID: 10493868]
[http://dx.doi.org/10.1007/978-1-4614-7138-7]
[http://dx.doi.org/10.1021/acs.jcim.7b00307] [PMID: 29125297]
[http://dx.doi.org/10.3390/molecules22122079] [PMID: 29182548]
[http://dx.doi.org/10.3390/ijms19020511] [PMID: 29419752]
[http://dx.doi.org/10.1093/bioinformatics/btl377] [PMID: 16837523]
[http://dx.doi.org/10.1016/j.febslet.2005.07.002] [PMID: 16051225]
[http://dx.doi.org/10.1016/j.ab.2014.12.009] [PMID: 25596338]
[http://dx.doi.org/10.1214/aos/1016218223]
[http://dx.doi.org/10.1145/2939672.2939785]
[http://dx.doi.org/10.1109/78.157290]
[http://dx.doi.org/10.5772/36434]
[http://dx.doi.org/10.1093/bioinformatics/btl055] [PMID: 16481334]
[http://dx.doi.org/10.1155/2014/103054]
[http://dx.doi.org/10.1016/S0167-2789(98)00045-1]
[http://dx.doi.org/10.1023/A:1007091128394] [PMID: 11043931]
[http://dx.doi.org/10.1016/j.bbrc.2005.08.160] [PMID: 16140260]
[http://dx.doi.org/10.1016/j.jtbi.2013.01.012] [PMID: 23376577]
[http://dx.doi.org/10.1016/j.neucom.2015.12.120]
[http://dx.doi.org/10.1145/1015330.1015424]
[http://dx.doi.org/10.1186/1471-2105-11-309] [PMID: 20529363]
[http://dx.doi.org/10.1371/journal.pone.0086703] [PMID: 24475169]
[http://dx.doi.org/10.1186/s12859-016-1201-8] [PMID: 27565741]
[http://dx.doi.org/10.1371/journal.pone.0106691] [PMID: 25184541]
[http://dx.doi.org/10.1186/1471-2105-14-90] [PMID: 23497329]
[http://dx.doi.org/10.1016/0005-2795(75)90109-9] [PMID: 1180967]
[http://dx.doi.org/10.1016/j.patrec.2005.10.010]
[http://dx.doi.org/10.1148/radiology.143.1.7063747] [PMID: 7063747]
[http://dx.doi.org/10.1371/journal.pone.0024756] [PMID: 21935457]
[http://dx.doi.org/10.1109/BIBM.2015.7359730]