Generic placeholder image

Current Bioinformatics

Editor-in-Chief

ISSN (Print): 1574-8936
ISSN (Online): 2212-392X

Research Article

EpiSemble: A Novel Ensemble-based Machine-learning Framework for Prediction of DNA N6-methyladenine Sites Using Hybrid Features Selection Approach for Crops

Author(s): Dipro Sinha, Tanwy Dasmandal, Md Yeasin, Dwijesh C. Mishra, Anil Rai and Sunil Archak*

Volume 18, Issue 7, 2023

Published on: 30 May, 2023

Page: [587 - 597] Pages: 11

DOI: 10.2174/1574893618666230316151648

Price: $65

Abstract

Aim: The study aimed to develop a robust and more precise 6mA methylation prediction tool that assists researchers in studying the epigenetic behaviour of crop plants.

Background: N6-methyladenine (6mA) is one of the predominant epigenetic modifications involved in a variety of biological processes in all three kingdoms of life. While in vitro approaches are more precise in detecting epigenetic alterations, they are resource-intensive and time-consuming. Artificial intelligence- based in silico methods have helped overcome these bottlenecks.

Methods: A novel machine learning framework was developed through the incorporation of four techniques: ensemble machine learning, hybrid approach for feature selection, the addition of features, such as Average Mutual Information Profile (AMIP), and bootstrap samples. In this study, four different feature sets, namely di-nucleotide frequency, GC content, AMIP, and nucleotide chemical properties were chosen for the vectorization of DNA sequences. Nine machine learning models, including support vector machine, random forest, k-nearest neighbor, artificial neural network, multiple logistic regression, decision tree, naïve Bayes, AdaBoost, and gradient boosting were employed using relevant features extracted through the feature selection module. The top three best-performing models were selected and a robust ensemble model was developed to predict sequences with 6mA sites.

Results: EpiSemble, a novel ensemble model was developed for the prediction of 6mA methylation sites. Using the new model, an improvement in accuracy of 7.0%, 3.74%, and 6.65% was achieved over existing models for RiceChen, RiceLv, and Arabidopsis datasets, respectively. An R package, EpiSemble, based on the new model was developed and made available at https://cran.rproject. org/web/packages/EpiSemble/index.html.

Conclusion: The EpiSemble model added AMIP as a novel feature, integrated feature selection modules, bootstrapping of samples, and ensemble technique to achieve an improved output for accurate prediction of 6mA sites in plants. To our knowledge, this is the first R package developed for predicting epigenetic sites of genomes in crop plants, which is expected to help plant researchers in their future explorations.

Graphical Abstract

[1]
Waddington CH. The epigenotype. Int J Epidemiol 2012; 41(1): 10-3.
[http://dx.doi.org/10.1093/ije/dyr184] [PMID: 22186258]
[2]
Ashapkin VV, Kutueva LI, Aleksandrushkina NI, Vanyushin BF. Epigenetic mechanisms of plant adaptation to biotic and abiotic stresses. Int J Mol Sci 2020; 21(20): 7457.
[http://dx.doi.org/10.3390/ijms21207457]
[3]
Saraswat S, Yadav AK, Sirohi P, Singh NK. Role of epigenetics in crop improvement: Water and heat stress. J Plant Biol 2017; 60(3): 231-40.
[http://dx.doi.org/10.1007/s12374-017-0053-8]
[4]
Ratel D, Ravanat JL, Berger F, Wion D. N6-methyladenine: The other methylated base of DNA. BioEssays 2006; 28(3): 309-15.
[http://dx.doi.org/10.1002/bies.20342] [PMID: 16479578]
[5]
Chen W, Yang H, Feng P, Ding H, Lin H. iDNA4mC: Identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics 2017; 33(22): 3518-23.
[http://dx.doi.org/10.1093/bioinformatics/btx479] [PMID: 28961687]
[6]
Wei L, Su R, Luan S, et al. Iterative feature representations improve N4-methylcytosine site prediction. Bioinformatics 2019; 35(23): 4930-7.
[http://dx.doi.org/10.1093/bioinformatics/btz408] [PMID: 31099381]
[7]
O’Brown ZK, Greer EL. N6-methyladenine: A conserved and dynamic DNA mark. Adv Exp Med Biol 2016; 945: 213-46.
[http://dx.doi.org/10.1007/978-3-319-43624-1_10] [PMID: 27826841]
[8]
Campbell JL, Kleckner NE. coli oriC and the DNA gene promoter are sequestered from dam methyltransferase following the passage of the chromosomal replication fork. Cell 1990; 62(5): 967-79.
[http://dx.doi.org/10.1016/0092-8674(90)90271-F] [PMID: 1697508]
[9]
Robbins-Manke JL, Zdraveski ZZ, Marinus M, Essigmann JM. Analysis of global gene expression and double-strand-break formation in DNA adenine methyltransferase- and mismatch repair-deficient Escherichia coli. J Bacteriol 2005; 187(20): 7027-37.
[http://dx.doi.org/10.1128/JB.187.20.7027-7037.2005] [PMID: 16199573]
[10]
Pukkila PJ, Peterson J, Herman G, Modrich P, Meselson M. Effects of high levels of DNA adenine methylation on methyl-directed mismatch repair in Escherichia coli. Genetics 1983; 104(4): 571-82.
[http://dx.doi.org/10.1093/genetics/104.4.571] [PMID: 6225697]
[11]
Tahir M, Tayara H, Chong KT. iDNA6mA (5-step rule): Identification of DNA N6-methyladenine sites in the rice genome by intelligent computational model via Chou’s 5-step rule. Chemom Intell Lab Syst 2019; 189: 96-101.
[http://dx.doi.org/10.1016/j.chemolab.2019.04.007]
[12]
Pomraning KR, Smith KM, Freitag M. Genome-wide high throughput analysis of DNA methylation in eukaryotes. Methods 2009; 47(3): 142-50.
[http://dx.doi.org/10.1016/j.ymeth.2008.09.022] [PMID: 18950712]
[13]
Krais AM, Cornelius MG, Schmeiser HH. Genomic N6-methyladenine determination by MEKC with LIF. Electrophoresis 2010; 31(21): 3548-51.
[http://dx.doi.org/10.1002/elps.201000357] [PMID: 20925053]
[14]
Flusberg BA, Webster DR, Lee JH, et al. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat Methods 2010; 7(6): 461-5.
[http://dx.doi.org/10.1038/nmeth.1459] [PMID: 20453866]
[15]
Zhou C, Wang C, Liu H, et al. Identification and analysis of adenine N6-methylation sites in the rice genome. Nat Plants 2018; 4(8): 554-63.
[http://dx.doi.org/10.1038/s41477-018-0214-x] [PMID: 30061746]
[16]
Chen W, Lv H, Nie F, Lin H. i6mA-Pred: Identifying DNA N6-methyladenine sites in the rice genome. Bioinformatics 2019; 35(16): 2796-800.
[http://dx.doi.org/10.1093/bioinformatics/btz015] [PMID: 30624619]
[17]
Hasan MM, Basith S, Khatun MS, Lee G, Manavalan B, Kurata H. Meta-i6mA: An interspecies predictor for identifying DNA N 6-methyladenine sites of plant genomes by exploiting informative features in an integrative machine-learning framework. Brief Bioinform 2021; 22(3): bbaa202.
[http://dx.doi.org/10.1093/bib/bbaa202] [PMID: 32910169]
[18]
Wang X, Yan R. RFAthM6A: A new tool for predicting m6A sites in Arabidopsis thaliana. Plant Mol Biol 2018; 96(3): 327-37.
[http://dx.doi.org/10.1007/s11103-018-0698-9] [PMID: 29340952]
[19]
Basith S, Manavalan B, Shin TH, Lee G. SDM6A: A web-based integrative machine-learning framework for predicting 6mA sites in the rice genome. Mol Ther Nucleic Acids 2019; 18: 131-41.
[http://dx.doi.org/10.1016/j.omtn.2019.08.011] [PMID: 31542696]
[20]
Lv H, Dao FY, Guan ZX, et al. iDNA6mA-Rice: A computational tool for detecting N6-methyladenine sites in rice. Front Genet 2019; 10: 793.
[http://dx.doi.org/10.3389/fgene.2019.00793] [PMID: 31552096]
[21]
Yu H, Dai Z. SNNRice6mA: A deep learning method for predicting DNA N6-methyladenine sites in rice genome. Front Genet 2019; 10: 1071.
[http://dx.doi.org/10.3389/fgene.2019.01071] [PMID: 31681441]
[22]
Kong L, Zhang L. i6mA-DNCP: Computational identification of DNA N6-Methyladenine sites in the rice genome using optimized dinucleotide-based features. Genes 2019; 10(10): 828.
[http://dx.doi.org/10.3390/genes10100828] [PMID: 31635172]
[23]
Huang Q, Zhang J, Wei L, Guo F, Zou Q. 6mA-RicePred: A method for identifying DNA N6-methyladenine sites in the rice genome based on feature fusion. Front Plant Sci 2020; 11: 4.
[http://dx.doi.org/10.3389/fpls.2020.00004] [PMID: 32076430]
[24]
Wang Y, Li J. Molecular basis of plant architecture. Annu Rev Plant Biol 2008; 59(1): 253-79.
[http://dx.doi.org/10.1146/annurev.arplant.59.032607.092902] [PMID: 18444901]
[25]
Qi X, Fuller E, Wu Q, Zhang CQ. Numerical characterization of DNA sequence based on dinucleotides. Sci World J 2012; 2012: 104269.
[http://dx.doi.org/10.1100/2012/104269] [PMID: 22619571]
[26]
Sharma A, Sinha D, Mishra DC, et al. MetaConClust-unsupervised binning of metagenomics data using consensus clustering. Curr Genomics 2022; 23(2): 137-46.
[http://dx.doi.org/10.2174/1389202923666220413114659]
[27]
Bauer M, Schuster SM, Sayood K. The average mutual information profile as a genomic signature. BMC Bioinformatics 2008; 9(1): 48.
[http://dx.doi.org/10.1186/1471-2105-9-48] [PMID: 18218139]
[28]
Chen FH, Howard H. An alternative model for the analysis of detecting electronic industries earnings management using stepwise regression, random forest, and decision tree. Soft Comput 2015; 20: 1945-60.
[http://dx.doi.org/10.1007/s00500-015-1616-6]
[29]
Cortes C, Vapnik V, Saitta L. Support-vector networks. Mach Learn 1995; 20: 273-97.
[http://dx.doi.org/10.1007/BF00994018]
[30]
Quinlan JR. Induction of decision trees. Mach Learn 1986; 1: 81-106.
[http://dx.doi.org/10.1007/BF00116251]
[31]
Breiman L. Random forests. Mach Learn 2001; 45: 5-32.
[http://dx.doi.org/10.1023/A:1010933404324]
[32]
Taunk K, De S, Verma S, Swetapadma A. A brief review of nearest neighbor algorithm for learning and classification. 2019 International Conference on Intelligent Computing and Control Systems (ICCS). Madurai, India. New York: IEEE 2019; pp. 1255-60.
[http://dx.doi.org/ 10.1109/ICCS45141.2019.9065747]
[33]
Grossi E, Buscema M. Introduction to artificial neural networks. Eur J Gastroenterol Hepatol 2007; 19(12): 1046-54.
[http://dx.doi.org/10.1097/MEG.0b013e3282f198a0] [PMID: 17998827]
[34]
Haque MM, Holder LB, Skinner MK. Genome-wide locations of potential epimutations associated with environmentally induced epigenetic transgenerational inheritance of disease using a sequential machine learning prediction approach. PLoS One 2015; 10(11): e0142274.
[http://dx.doi.org/10.1371/journal.pone.0142274] [PMID: 26571271]
[35]
Xia C, Xiao Y, Wu J, Zhao X, Li H. A convolutional neural networkbased ensemble method for cancer prediction using DNA methylation data. In Proceedings of the 2019 11th International Conference on Machine Learning and Computing 2019. Zhuhai China. New York: ACM 2019; pp. 191-6.
[http://dx.doi.org/ 10.1145/3318299.3318372]
[36]
Friedman JH. Greedy function approximation: A gradient boosting machine. Ann Stat 2001; 29(5): 1189-232.
[http://dx.doi.org/10.1214/aos/1013203451]
[37]
Yang ZR. Biological applications of support vector machines. Brief Bioinform 2004; 5(4): 328-38.
[http://dx.doi.org/10.1093/bib/5.4.328] [PMID: 15606969]
[38]
Chen X, Ishwaran H. Random forests for genomic data analysis. Genomics 2012; 99(6): 323-9.
[http://dx.doi.org/10.1016/j.ygeno.2012.04.003] [PMID: 22546560]
[39]
Ma B, Meng F, Yan G, Yan H, Chai B, Song F. Diagnostic classification of cancers using extreme gradient boosting algorithm and multi-omics data. Comput Biol Med 2020; 121: 103761.
[http://dx.doi.org/10.1016/j.compbiomed.2020.103761] [PMID: 32339094]
[40]
Kha QH, Tran TO, Nguyen TTD, Nguyen VN, Than K, Le NQK. An interpretable deep learning model for classifying adaptor protein complexes from sequence information. Methods 2022; 207: 90-6.
[http://dx.doi.org/10.1016/j.ymeth.2022.09.007] [PMID: 36174933]
[41]
Kha QH, Ho QT, Le NQK. Identifying SNARE proteins using an alignment-free method based on multiscan convolutional neural network and PSSM profiles. J Chem Inf Model 2022; 62(19): 4820-6.
[http://dx.doi.org/10.1021/acs.jcim.2c01034] [PMID: 36166351]
[42]
Le NQK, Ho QT, Nguyen VN, Chang JS. BERT-Promoter: An improved sequence-based predictor of DNA promoter using BERT pre-trained model and SHAP feature selection. Comput Biol Chem 2022; 99: 107732.
[http://dx.doi.org/10.1016/j.compbiolchem.2022.107732] [PMID: 35863177]
[43]
Lv H, Dao FY, Zhang D, et al. iDNA-MS: An integrated computational tool for detecting DNA modification sites in multiple genomes. iScience 2020; 23(4): 100991.
[http://dx.doi.org/10.1016/j.isci.2020.100991] [PMID: 32240948]

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy