Abstract
Introduction: Transcription factors are of great interest in biotechnology due to their key role in the regulation of gene expression. One of the most important transcription factors in gramnegative bacteria is Fur, a global regulator studied as a therapeutic target for the design of antibacterial agents. Its DNA-binding domain, which contains a helix-turn-helix motif, is one of its most relevant features.
Methods: In this study, we evaluated several machine learning algorithms for the prediction of DNA-binding sites based on proteins from the Fur superfamily and other helix-turn-helix transcription factors, including Support-Vector Machines (SVM), Random Forest (RF), Decision Trees (DT), and Naive Bayes (NB). We also tested the efficacy of using several molecular descriptors derived from the amino acid sequence and the structure of the protein fragments that bind the DNA. A feature selection procedure was employed to select fewer descriptors in each case by maintaining a good classification performance.
Results: The best results were obtained with the SVM model using twelve sequence-derived attributes and the DT model using nine structure-derived features, achieving 82% and 76% accuracy, respectively.
Conclusion: The performance obtained indicates that the descriptors we used are relevant for predicting DNA-binding sites since they can discriminate between binding and non-binding regions of a protein.
[http://dx.doi.org/10.1016/j.biotechadv.2022.107935]
[http://dx.doi.org/10.1038/nature11212] [PMID: 22955618]
[http://dx.doi.org/10.1073/pnas.0508637103] [PMID: 17003135]
[http://dx.doi.org/10.1007/s00253-015-6587-0] [PMID: 25913005]
[http://dx.doi.org/10.1016/j.biotechadv.2013.02.010] [PMID: 23473970]
[http://dx.doi.org/10.1002/jcb.25605] [PMID: 27191703]
[http://dx.doi.org/10.1038/ncomms5910] [PMID: 25222563]
[http://dx.doi.org/10.1016/S1369-5274(00)00184-3] [PMID: 11282473]
[http://dx.doi.org/10.2217/fmb.13.43] [PMID: 23701330]
[http://dx.doi.org/10.1046/j.1365-2958.2003.03337.x] [PMID: 12581348]
[http://dx.doi.org/10.1016/S0255-0857(21)02343-4] [PMID: 16912433]
[http://dx.doi.org/10.1021/cb5005977] [PMID: 25238402]
[http://dx.doi.org/10.1021/acschembio.6b00360] [PMID: 27409249]
[http://dx.doi.org/10.1021/acs.inorgchem.7b02380] [PMID: 29200284]
[http://dx.doi.org/10.1002/prot.26229] [PMID: 34455627]
[http://dx.doi.org/10.3390/ijms22115510] [PMID: 34073705]
[http://dx.doi.org/10.1371/journal.pone.0106691] [PMID: 25184541]
[http://dx.doi.org/10.1155/2020/1384749] [PMID: 32300371]
[http://dx.doi.org/10.1093/bib/bbz037] [PMID: 30957840]
[http://dx.doi.org/10.1093/molbev/msr121] [PMID: 21546353]
[http://dx.doi.org/10.1093/bioinformatics/bti825] [PMID: 16339280]
[http://dx.doi.org/10.32614/RJ-2015-001]
[http://dx.doi.org/10.1038/srep42362] [PMID: 28205576]
[http://dx.doi.org/10.1093/nar/gki387] [PMID: 15980494]
[http://dx.doi.org/10.1016/j.compeleceng.2013.11.024]
[http://dx.doi.org/10.1038/s41746-021-00521-5] [PMID: 34711924]
[http://dx.doi.org/10.1038/s41598-017-14945-1] [PMID: 29097781]
[http://dx.doi.org/10.1371/journal.pone.0086703] [PMID: 24475169]