i4mC-CPXG: A Computational Model for Identifying DNA N4-
methylcytosine Sites in Rosaceae Genome Using Novel Encoding Strategy

Lichao      Zhang; Ying      Liang; Kang      Xiao; Liang      Kong

doi:10.2174/1574893618666221124095411

Abstract

Background: N4-methylcytosine (4mC) is one of the most widespread DNA methylation modifications, which plays an important role in DNA replication and repair, epigenetic inheritance, gene expression levels and regulation of transcription. Although biological experiments can identify potential 4mC modification sites, they are limited due to the experimental environment and labor intensive. Therefore, it is crucial to construct a computational model to identify the 4mC sites.

Objective: Although some computational methods have been proposed to identify the 4mC sites, some problems should not be ignored, such as: (1) a large number of unknown nucleotides exist in the biological sequence; (2) a large number of zeros exist in the previous encoding technologies; (3) sequence distribution information is important to identify 4mC sites. Considering these aspects, we propose a computational model based on a novel encoding strategy with position specific information to identify 4mC sites.

Methods: We constructed an accurate computational model i4mC-CPXG based on extreme gradient boosting. Two aspects of feature vectors are extracted according to nucleotide information and position specific information. From the aspect of nucleotide information, we used prior information to identify the base type of unknown nucleotide and decrease the influence of invalid information caused by lots of zeros. From the aspect of position specific information, the vector was designed carefully to express the base distribution and arrangement. Then the feature vector fused by nucleotide information and position specific information was input into extreme gradient boosting to construct the model.

Results: The accuracy of i4mC-CPXG is 82.49% on independent dataset. The result was better than model i4mC-w2vec which was the best model in the imbalanced dataset with the ratio of 1:15. Meanwhile, our model achieved good performance on other species. These results validated the effectiveness of i4mC-CPXG.

Conclusion: Our method is effective to identify potential 4mC modification sites due to the proposed new encoding strategy fused position specific information. The satisfactory prediction results of balanced datasets, imbalanced datasets and other species datasets indicate that i4mC-CPXG is valuable to provide a reasonable supplement for biology research.

« Previous Next »

Graphical Abstract

[1]
Tahir M, Tayara H, Chong KT. iDNA6mA (5-step rule): Identification of DNA N6-methyladenine sites in the rice genome by intelligent computational model via Chou’s 5-step rule. Chemom Intell Lab Syst  2019; 189: 96-101.
 [http://dx.doi.org/10.1016/j.chemolab.2019.04.007]

[2]
Akalin A, Garrett-Bakelman FE, Kormaksson M, et al. Base-pair resolution DNA methylation sequencing reveals profoundly divergent epigenetic landscapes in acute myeloid leukemia. PLoS Genet  2012; 8(6): e1002781.
 [http://dx.doi.org/10.1371/journal.pgen.1002781] [PMID:  22737091]

[3]
Suzuki MM, Bird A. DNA methylation landscapes: Provocative insights from epigenomics. Nat Rev Genet  2008; 9(6): 465-76.
 [http://dx.doi.org/10.1038/nrg2341] [PMID:  18463664]

[4]
Jones PA. Functions of DNA methylation: Islands, start sites, gene bodies and beyond. Nat Rev Genet  2012; 13(7): 484-92.
 [http://dx.doi.org/10.1038/nrg3230] [PMID:  22641018]

[5]
Ling C, Groop L. Epigenetics: A molecular link between environmental factors and type 2 diabetes. Diabetes  2009; 58(12): 2718-25.
 [http://dx.doi.org/10.2337/db09-1003] [PMID:  19940235]

[6]
Khanal J, Tayara H, Zou Q, Chong KT. Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation. Comput Struct Biotechnol J  2021; 19: 1612-9.
 [http://dx.doi.org/10.1016/j.csbj.2021.03.015] [PMID:  33868598]

[7]
Schweizer HP. Bacterial genetics: Past achievements, present state of the field, and future challenges. Biotechniques  2008; 44(5): 633-641-6-641.
 [http://dx.doi.org/10.2144/000112807] [PMID: 18474038]

[8]
Ehrlich M, Wilson GG, Kuo KC, Gehrke CW. N4-methylcytosine as a minor base in bacterial DNA. J Bacteriol  1987; 169(3): 939-43.
 [http://dx.doi.org/10.1128/jb.169.3.939-943.1987] [PMID:  3029036]

[9]
Glickman BW, Radman M. Escherichia coli mutator mutants deficient in methylation-instructed DNA mismatch correction. Proc Natl Acad Sci USA  1980; 77(2): 1063-7.
 [http://dx.doi.org/10.1073/pnas.77.2.1063] [PMID:  6987663]

[10]
Lu AL, Clark S, Modrich P. Methyl-directed repair of DNA base-pair mismatches in vitro. Proc Natl Acad Sci USA  1983; 80(15): 4639-43.
 [http://dx.doi.org/10.1073/pnas.80.15.4639] [PMID:  6308634]

[11]
Pukkila PJ, Peterson J, Herman G, Modrich P, Meselson M. Effects of high levels of DNA adenine methylation on methyl-directed mismatch repair in Escherichia coli. Genetics  1983; 104(4): 571-82.
 [http://dx.doi.org/10.1093/genetics/104.4.571] [PMID:  6225697]

[12]
Chen K, Zhao BS, He C. Nucleic acid modifications in regulation of gene expression. Cell Chem Biol  2016; 23(1): 74-85.
 [http://dx.doi.org/10.1016/j.chembiol.2015.11.007] [PMID:  26933737]

[13]
He S, Kong L, Chen J. iDNA6mA-Rice-DL: A local web server for identifying DNA N6-methyladenine sites in rice genome by deep learning method. J Bioinform Comput Biol  2021; 19(5): 2150019.
 [http://dx.doi.org/10.1142/S0219720021500190] [PMID:  34291710]

[14]
Doherty R, Couldrey C. Exploring genome wide bisulfite sequencing for DNA methylation analysis in livestock: A technical assessment. Front Genet  2014; 5: 126.
 [http://dx.doi.org/10.3389/fgene.2014.00126] [PMID:  24860595]

[15]
Flusberg BA, Webster DR, Lee JH, et al. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat Methods  2010; 7(6): 461-5.
 [http://dx.doi.org/10.1038/nmeth.1459] [PMID:  20453866]

[16]
Boch J, Bonas U. Xanthomonas AvrBs3 family-type III effectors: Discovery and function. Annu Rev Phytopathol  2010; 48(1): 419-36.
 [http://dx.doi.org/10.1146/annurev-phyto-080508-081936] [PMID:  19400638]

[17]
Buryanov YI, Shevchuk TV. DNA methyltransferases and structural-functional specificity of eukaryotic DNA modification. Biochemistry (Mosc)  2005; 70(7): 730-42.
 [http://dx.doi.org/10.1007/s10541-005-0178-0] [PMID:  16097936]

[18]
Chen W, Yang H, Feng P, Ding H, Lin H. iDNA4mC: Identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics  2017; 33(22): 3518-23.
 [http://dx.doi.org/10.1093/bioinformatics/btx479] [PMID:  28961687]

[19]
Wei L, Su R, Luan S, et al. Iterative feature representations improve N4-methylcytosine site prediction. Bioinformatics  2019; 35(23): 4930-7.
 [http://dx.doi.org/10.1093/bioinformatics/btz408] [PMID:  31099381]

[20]
Manavalan B, Basith S, Shin TH, Wei L, Lee G. Meta-4mCpred: A sequence-based meta-predictor for accurate DNA 4mC site prediction using effective feature representation. Mol Ther Nucleic Acids  2019; 16: 733-44.
 [http://dx.doi.org/10.1016/j.omtn.2019.04.019] [PMID:  31146255]

[21]
Hasan MM, Manavalan B, Khatun MS, Kurata H. i4mC-ROSE, a bioinformatics tool for the identification of DNA N4-methylcytosine sites in the Rosaceae genome. Int J Biol Macromol  2020; 157: 752-8.
 [http://dx.doi.org/10.1016/j.ijbiomac.2019.12.009] [PMID:  31805335]

[22]
Wahab A, Mahmoudi O, Kim J, Chong KT. DNC4mC-Deep: Identification and analysis of DNA N4-methylcytosine sites based on different encoding schemes by using deep learning. Cells  2020; 9(8): 1756.
 [http://dx.doi.org/10.3390/cells9081756] [PMID:  32707969]

[23]
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics  2012; 28(23): 3150-2.
 [http://dx.doi.org/10.1093/bioinformatics/bts565] [PMID:  23060610]

[24]
Zhang L, Huang Z, Kong L. CSBPI_Site:Multi-information sources of features to RNA binding sites prediction. Curr Bioinform  2021; 16(5): 691-9.
 [http://dx.doi.org/10.2174/1574893615666210108093950]

[25]
Wang J, Gribskov M. IRESpy: An XGBoost model for prediction of internal ribosome entry sites. BMC Bioinformatics  2019; 20(1): 409.
 [http://dx.doi.org/10.1186/s12859-019-2999-7] [PMID:  31362694]

[26]
Mishra A, Khanal R, Kabir WU, Hoque T. AIRBP: Accurate identification of RNA-binding proteins using machine learning techniques. Artif Intell Med  2021; 113: 102034.
 [http://dx.doi.org/10.1016/j.artmed.2021.102034] [PMID:  33685590]

[27]
Cortes C, Vapnik V. Support-vector networks. Mach Learn  1995; 20(3): 273-97.
 [http://dx.doi.org/10.1007/BF00994018]

[28]
Zhang CJ, Tang H, Li WC, Lin H, Chen W, Chou KC. iOri-Human: Identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition. Oncotarget  2016; 7(43): 69783-93.
 [http://dx.doi.org/10.18632/oncotarget.11975] [PMID:  27626500]

[29]
Sun Y, Liu Z, Todorovic S, Li J. Adaptive boosting for SAR automatic target recognition. IEEE Trans Aerosp Electron Syst  2007; 43(1): 112-25.
 [http://dx.doi.org/10.1109/TAES.2007.357120]

[30]
Breiman L. Bagging predictors. Mach Learn  1996; 24(2): 123-40.
 [http://dx.doi.org/10.1007/BF00058655]

[31]
Agatonovic-Kustrin S, Beresford R. Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J Pharm Biomed Anal  2000; 22(5): 717-27.
 [http://dx.doi.org/10.1016/S0731-7085(99)00272-1] [PMID:  10815714]

[32]
Graves A. Long short-term memory. In: Supervised sequence labelling with recurrent neural networks Berlin, Heidelberg: Springer.   2012; 385: pp. 37-45.
 [http://dx.doi.org/10.1007/978-3-642-24797-2_4]

[33]
Wythoff BJ. Backpropagation neural networks. Chemom Intell Lab Syst  1993; 18(2): 115-55.
 [http://dx.doi.org/10.1016/0169-7439(93)80052-J]

[34]
Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in Python. the Journal of machine Learning research  2011; 12: 2825-30.

[35]
Pang B, Nijkamp E, Wu YN. Deep learning with tensorflow: A review. J Educ Behav Stat  2020; 45(2): 227-48.
 [http://dx.doi.org/10.3102/1076998619872761]

[36]
Paszke A, Gross S, Massa F, et al. Pytorch: An imperative style, high-performance deep learning library. Adv Neural Inf Process Syst  2019; 32.

[37]
Vacic V, Iakoucheva LM, Radivojac P. Two sample logo: A graphical representation of the differences between two sets of sequence alignments. Bioinformatics  2006; 22(12): 1536-7.
 [http://dx.doi.org/10.1093/bioinformatics/btl151] [PMID:  16632492]

[38]
Lv Z, Wang D, Ding H, Zhong B, Xu L. Escherichia coli DNA N- 4-methycytosine site prediction accuracy improved by light gradient boosting machine feature selection technology. IEEE Access  2020 8; 14851-9.
 [http://dx.doi.org/10.1109/ACCESS.2020.2966576]

[39]
Wei L, Luan S, Nagai LAE, Su R, Zou Q. Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species. Bioinformatics  2019; 35(8): 1326-33.
 [http://dx.doi.org/10.1093/bioinformatics/bty824] [PMID:  30239627]

[40]
Yang J, Lang K, Zhang G, Fan X, Chen Y, Pian C. SOMM4mC: A second-order Markov model for DNA N4-methylcytosine site prediction in six species. Bioinformatics  2020; 36(14): 4103-5.
 [http://dx.doi.org/10.1093/bioinformatics/btaa507] [PMID:  32413127]

[41]
Alam W, Tayara H, Chong KT. i4mC-Deep: An intelligent predictor of n4-methylcytosine sites using a deep learning approach with chemical properties. Genes (Basel)  2021; 12(8): 1117.
 [http://dx.doi.org/10.3390/genes12081117] [PMID:  34440291]

Rights & Permissions Print Cite

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893618666221124095411	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

i4mC-CPXG: A Computational Model for Identifying DNA N4- methylcytosine Sites in Rosaceae Genome Using Novel Encoding Strategy

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract