EpiSemble: A Novel Ensemble-based Machine-learning Framework for
Prediction of DNA N6-methyladenine Sites Using Hybrid Features
Selection Approach for Crops

Dipro      Sinha; Tanwy      Dasmandal; Md      Yeasin; Dwijesh   C.   Mishra; Anil      Rai; Sunil      Archak

doi:10.2174/1574893618666230316151648

Abstract

Aim: The study aimed to develop a robust and more precise 6mA methylation prediction tool that assists researchers in studying the epigenetic behaviour of crop plants.

Background: N6-methyladenine (6mA) is one of the predominant epigenetic modifications involved in a variety of biological processes in all three kingdoms of life. While in vitro approaches are more precise in detecting epigenetic alterations, they are resource-intensive and time-consuming. Artificial intelligence- based in silico methods have helped overcome these bottlenecks.

Methods: A novel machine learning framework was developed through the incorporation of four techniques: ensemble machine learning, hybrid approach for feature selection, the addition of features, such as Average Mutual Information Profile (AMIP), and bootstrap samples. In this study, four different feature sets, namely di-nucleotide frequency, GC content, AMIP, and nucleotide chemical properties were chosen for the vectorization of DNA sequences. Nine machine learning models, including support vector machine, random forest, k-nearest neighbor, artificial neural network, multiple logistic regression, decision tree, naïve Bayes, AdaBoost, and gradient boosting were employed using relevant features extracted through the feature selection module. The top three best-performing models were selected and a robust ensemble model was developed to predict sequences with 6mA sites.

Results: EpiSemble, a novel ensemble model was developed for the prediction of 6mA methylation sites. Using the new model, an improvement in accuracy of 7.0%, 3.74%, and 6.65% was achieved over existing models for RiceChen, RiceLv, and Arabidopsis datasets, respectively. An R package, EpiSemble, based on the new model was developed and made available at https://cran.rproject. org/web/packages/EpiSemble/index.html.

Conclusion: The EpiSemble model added AMIP as a novel feature, integrated feature selection modules, bootstrapping of samples, and ensemble technique to achieve an improved output for accurate prediction of 6mA sites in plants. To our knowledge, this is the first R package developed for predicting epigenetic sites of genomes in crop plants, which is expected to help plant researchers in their future explorations.

« Previous Next »

Graphical Abstract

[1]
Waddington CH. The epigenotype. Int J Epidemiol  2012; 41(1): 10-3.
 [http://dx.doi.org/10.1093/ije/dyr184] [PMID:  22186258]

[2]
Ashapkin VV, Kutueva LI, Aleksandrushkina NI, Vanyushin BF. Epigenetic mechanisms of plant adaptation to biotic and abiotic stresses. Int J Mol Sci  2020; 21(20): 7457.
 [http://dx.doi.org/10.3390/ijms21207457]

[3]
Saraswat S, Yadav AK, Sirohi P, Singh NK. Role of epigenetics in crop improvement: Water and heat stress. J Plant Biol  2017; 60(3): 231-40.
 [http://dx.doi.org/10.1007/s12374-017-0053-8]

[4]
Ratel D, Ravanat JL, Berger F, Wion D. N6-methyladenine: The other methylated base of DNA. BioEssays  2006; 28(3): 309-15.
 [http://dx.doi.org/10.1002/bies.20342] [PMID:  16479578]

[5]
Chen W, Yang H, Feng P, Ding H, Lin H. iDNA4mC: Identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics  2017; 33(22): 3518-23.
 [http://dx.doi.org/10.1093/bioinformatics/btx479] [PMID:  28961687]

[6]
Wei L, Su R, Luan S, et al. Iterative feature representations improve N4-methylcytosine site prediction. Bioinformatics  2019; 35(23): 4930-7.
 [http://dx.doi.org/10.1093/bioinformatics/btz408] [PMID:  31099381]

[7]
O’Brown ZK, Greer EL. N6-methyladenine: A conserved and dynamic DNA mark. Adv Exp Med Biol  2016; 945: 213-46.
 [http://dx.doi.org/10.1007/978-3-319-43624-1_10] [PMID:  27826841]

[8]
Campbell JL, Kleckner NE. coli oriC and the DNA gene promoter are sequestered from dam methyltransferase following the passage of the chromosomal replication fork. Cell  1990; 62(5): 967-79.
 [http://dx.doi.org/10.1016/0092-8674(90)90271-F] [PMID:  1697508]

[9]
Robbins-Manke JL, Zdraveski ZZ, Marinus M, Essigmann JM. Analysis of global gene expression and double-strand-break formation in DNA adenine methyltransferase- and mismatch repair-deficient Escherichia coli. J Bacteriol  2005; 187(20): 7027-37.
 [http://dx.doi.org/10.1128/JB.187.20.7027-7037.2005] [PMID:  16199573]

[10]
Pukkila PJ, Peterson J, Herman G, Modrich P, Meselson M. Effects of high levels of DNA adenine methylation on methyl-directed mismatch repair in Escherichia coli. Genetics  1983; 104(4): 571-82.
 [http://dx.doi.org/10.1093/genetics/104.4.571] [PMID:  6225697]

[11]
Tahir M, Tayara H, Chong KT. iDNA6mA (5-step rule): Identification of DNA N6-methyladenine sites in the rice genome by intelligent computational model via Chou’s 5-step rule. Chemom Intell Lab Syst  2019; 189: 96-101.
 [http://dx.doi.org/10.1016/j.chemolab.2019.04.007]

[12]
Pomraning KR, Smith KM, Freitag M. Genome-wide high throughput analysis of DNA methylation in eukaryotes. Methods  2009; 47(3): 142-50.
 [http://dx.doi.org/10.1016/j.ymeth.2008.09.022] [PMID:  18950712]

[13]
Krais AM, Cornelius MG, Schmeiser HH. Genomic N6-methyladenine determination by MEKC with LIF. Electrophoresis  2010; 31(21): 3548-51.
 [http://dx.doi.org/10.1002/elps.201000357] [PMID:  20925053]

[14]
Flusberg BA, Webster DR, Lee JH, et al. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat Methods  2010; 7(6): 461-5.
 [http://dx.doi.org/10.1038/nmeth.1459] [PMID:  20453866]

[15]
Zhou C, Wang C, Liu H, et al. Identification and analysis of adenine N6-methylation sites in the rice genome. Nat Plants  2018; 4(8): 554-63.
 [http://dx.doi.org/10.1038/s41477-018-0214-x] [PMID:  30061746]

[16]
Chen W, Lv H, Nie F, Lin H. i6mA-Pred: Identifying DNA N6-methyladenine sites in the rice genome. Bioinformatics  2019; 35(16): 2796-800.
 [http://dx.doi.org/10.1093/bioinformatics/btz015] [PMID:  30624619]

[17]
Hasan MM, Basith S, Khatun MS, Lee G, Manavalan B, Kurata H. Meta-i6mA: An interspecies predictor for identifying DNA N 6-methyladenine sites of plant genomes by exploiting informative features in an integrative machine-learning framework. Brief Bioinform  2021; 22(3): bbaa202.
 [http://dx.doi.org/10.1093/bib/bbaa202] [PMID:  32910169]

[18]
Wang X, Yan R. RFAthM6A: A new tool for predicting m6A sites in Arabidopsis thaliana. Plant Mol Biol  2018; 96(3): 327-37.
 [http://dx.doi.org/10.1007/s11103-018-0698-9] [PMID:  29340952]

[19]
Basith S, Manavalan B, Shin TH, Lee G. SDM6A: A web-based integrative machine-learning framework for predicting 6mA sites in the rice genome. Mol Ther Nucleic Acids  2019; 18: 131-41.
 [http://dx.doi.org/10.1016/j.omtn.2019.08.011] [PMID:  31542696]

[20]
Lv H, Dao FY, Guan ZX, et al. iDNA6mA-Rice: A computational tool for detecting N6-methyladenine sites in rice. Front Genet  2019; 10: 793.
 [http://dx.doi.org/10.3389/fgene.2019.00793] [PMID:  31552096]

[21]
Yu H, Dai Z. SNNRice6mA: A deep learning method for predicting DNA N6-methyladenine sites in rice genome. Front Genet  2019; 10: 1071.
 [http://dx.doi.org/10.3389/fgene.2019.01071] [PMID:  31681441]

[22]
Kong L, Zhang L. i6mA-DNCP: Computational identification of DNA N6-Methyladenine sites in the rice genome using optimized dinucleotide-based features. Genes   2019; 10(10): 828.
 [http://dx.doi.org/10.3390/genes10100828] [PMID:  31635172]

[23]
Huang Q, Zhang J, Wei L, Guo F, Zou Q. 6mA-RicePred: A method for identifying DNA N6-methyladenine sites in the rice genome based on feature fusion. Front Plant Sci  2020; 11: 4.
 [http://dx.doi.org/10.3389/fpls.2020.00004] [PMID:  32076430]

[24]
Wang Y, Li J. Molecular basis of plant architecture. Annu Rev Plant Biol  2008; 59(1): 253-79.
 [http://dx.doi.org/10.1146/annurev.arplant.59.032607.092902] [PMID:  18444901]

[25]
Qi X, Fuller E, Wu Q, Zhang CQ. Numerical characterization of DNA sequence based on dinucleotides. Sci World J  2012; 2012: 104269.
 [http://dx.doi.org/10.1100/2012/104269] [PMID:  22619571]

[26]
Sharma A, Sinha D, Mishra DC, et al. MetaConClust-unsupervised binning of metagenomics data using consensus clustering. Curr Genomics  2022; 23(2): 137-46.
 [http://dx.doi.org/10.2174/1389202923666220413114659]

[27]
Bauer M, Schuster SM, Sayood K. The average mutual information profile as a genomic signature. BMC Bioinformatics  2008; 9(1): 48.
 [http://dx.doi.org/10.1186/1471-2105-9-48] [PMID:  18218139]

[28]
Chen FH, Howard H. An alternative model for the analysis of detecting electronic industries earnings management using stepwise regression, random forest, and decision tree. Soft Comput  2015; 20: 1945-60.
 [http://dx.doi.org/10.1007/s00500-015-1616-6]

[29]
Cortes C, Vapnik V, Saitta L. Support-vector networks. Mach Learn  1995; 20: 273-97.
 [http://dx.doi.org/10.1007/BF00994018]

[30]
Quinlan JR. Induction of decision trees. Mach Learn  1986; 1: 81-106.
 [http://dx.doi.org/10.1007/BF00116251]

[31]
Breiman L. Random forests. Mach Learn  2001; 45: 5-32.
 [http://dx.doi.org/10.1023/A:1010933404324]

[32]
Taunk K, De S, Verma S, Swetapadma A. A brief review of nearest neighbor algorithm for learning and classification. 2019 International Conference on Intelligent Computing and Control Systems (ICCS).  Madurai, India. New York: IEEE 2019; pp. 1255-60.
 [http://dx.doi.org/ 10.1109/ICCS45141.2019.9065747]

[33]
Grossi E, Buscema M. Introduction to artificial neural networks. Eur J Gastroenterol Hepatol  2007; 19(12): 1046-54.
 [http://dx.doi.org/10.1097/MEG.0b013e3282f198a0] [PMID:  17998827]

[34]
Haque MM, Holder LB, Skinner MK. Genome-wide locations of potential epimutations associated with environmentally induced epigenetic transgenerational inheritance of disease using a sequential machine learning prediction approach. PLoS One  2015; 10(11): e0142274.
 [http://dx.doi.org/10.1371/journal.pone.0142274] [PMID:  26571271]

[35]
Xia C, Xiao Y, Wu J, Zhao X, Li H. A convolutional neural networkbased ensemble method for cancer prediction using DNA methylation data. In Proceedings of the 2019 11th International Conference on Machine Learning and Computing 2019.  Zhuhai China. New York: ACM 2019; pp. 191-6.
 [http://dx.doi.org/ 10.1145/3318299.3318372]

[36]
Friedman JH. Greedy function approximation: A gradient boosting machine. Ann Stat  2001; 29(5): 1189-232.
 [http://dx.doi.org/10.1214/aos/1013203451]

[37]
Yang ZR. Biological applications of support vector machines. Brief Bioinform  2004; 5(4): 328-38.
 [http://dx.doi.org/10.1093/bib/5.4.328] [PMID:  15606969]

[38]
Chen X, Ishwaran H. Random forests for genomic data analysis. Genomics  2012; 99(6): 323-9.
 [http://dx.doi.org/10.1016/j.ygeno.2012.04.003] [PMID:  22546560]

[39]
Ma B, Meng F, Yan G, Yan H, Chai B, Song F. Diagnostic classification of cancers using extreme gradient boosting algorithm and multi-omics data. Comput Biol Med  2020; 121: 103761.
 [http://dx.doi.org/10.1016/j.compbiomed.2020.103761] [PMID:  32339094]

[40]
Kha QH, Tran TO, Nguyen TTD, Nguyen VN, Than K, Le NQK. An interpretable deep learning model for classifying adaptor protein complexes from sequence information. Methods  2022; 207: 90-6.
 [http://dx.doi.org/10.1016/j.ymeth.2022.09.007] [PMID:  36174933]

[41]
Kha QH, Ho QT, Le NQK. Identifying SNARE proteins using an alignment-free method based on multiscan convolutional neural network and PSSM profiles. J Chem Inf Model  2022; 62(19): 4820-6.
 [http://dx.doi.org/10.1021/acs.jcim.2c01034] [PMID:  36166351]

[42]
Le NQK, Ho QT, Nguyen VN, Chang JS. BERT-Promoter: An improved sequence-based predictor of DNA promoter using BERT pre-trained model and SHAP feature selection. Comput Biol Chem  2022; 99: 107732.
 [http://dx.doi.org/10.1016/j.compbiolchem.2022.107732] [PMID:  35863177]

[43]
Lv H, Dao FY, Zhang D, et al. iDNA-MS: An integrated computational tool for detecting DNA modification sites in multiple genomes. iScience  2020; 23(4): 100991.
 [http://dx.doi.org/10.1016/j.isci.2020.100991] [PMID:  32240948]

Rights & Permissions Print Cite

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893618666230316151648	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

EpiSemble: A Novel Ensemble-based Machine-learning Framework for Prediction of DNA N6-methyladenine Sites Using Hybrid Features Selection Approach for Crops

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract