Generic placeholder image

Current Bioinformatics

Editor-in-Chief

ISSN (Print): 1574-8936
ISSN (Online): 2212-392X

Research Article

m5C-HPromoter: An Ensemble Deep Learning Predictor for Identifying 5-methylcytosine Sites in Human Promoters

Author(s): Xuan Xiao*, Yu-Tao Shao, Zhen-Tao Luo and Wang-Ren Qiu*

Volume 17, Issue 5, 2022

Published on: 02 June, 2022

Page: [452 - 461] Pages: 10

DOI: 10.2174/1574893617666220330150259

Price: $65

Abstract

Aims: This paper is intended to identify 5-methylcytosine sites in human promoters.

Background: Aberrant DNA methylation patterns are often associated with tumor development. Moreover, hypermethylation inhibits the expression of tumor suppressor genes, and hypomethylation stimulates the expression of certain oncogenes. Most DNA methylation occurs on the CpGisland of the gene promoter region.

Objective: Therefore, a comprehensive assessment of methylation status of the promoter region of human gene is extremely important for understanding cancer pathogenesis and the function of posttranscriptional modification.

Methods: This paper constructed three human promoter methylation datasets, which comprise of a total of 3 million sample sequences of small cell lung cancer, non-small cell lung cancer, and hepatocellular carcinoma from the Cancer Cell Line Encyclopedia (CCLE) database. Frequency-based One-Hot Encoding was used to encode the sample sequence, and an innovative stacking-based ensemble deep learning classifier was applied to establish the m5C-HPromoter predictor.

Results: Taking the average of 10 times of 5-fold cross-validation, m5C-HPromoter obtained a good result in terms of Accuracy (Acc)=0.9270, Matthew's correlation coefficient(MCC)=0.7234, Sensitivity( Sn)=0.9123, and Specificity(Sp)=0.9290.

Conclusion: Numerical experiments showed that the proposed m5C-HPromoter has greatly improved the prediction performance compared to the existing iPromoter-5mC predictor. The primary reason is that frequency-based One-Hot encoding solves the too-long and sparse features problems of One-Hot encoding and effectively reflects the sequence feature of DNA sequences. The second reason is that the combination of upsampling and downsampling has achieved great success in solving the imbalance problem. The third reason is the stacking-based ensemble deep learning model that overcomes the shortcomings of various models and has the strengths of various models. The user-friendly web-server m5C-HPromoter is freely accessible to the public at the website: http://121.36.221.79/m5C-HPromoter or http://bioinfo.jcu.edu.cn/m5C-HPromoter, and the predictor program has been uploaded from the website: https://github.com/liujin66/m5C-HPromoter.

Keywords: 5-methylcytosine, human promoters, frequency-based One-Hot encoding, deep neural network, ensemble deep learning, DNA methylation.

Graphical Abstract

[1]
Jones PA. Functions of DNA methylation: Islands, start sites, gene bodies and beyond. Nat Rev Genet 2012; 13(7): 484-92.
[http://dx.doi.org/10.1038/nrg3230] [PMID: 22641018]
[2]
Belinsky SA. Gene-promoter hypermethylation as a biomarker in lung cancer. Nat Rev Cancer 2004; 4(9): 707-17.
[http://dx.doi.org/10.1038/nrc1432] [PMID: 15343277]
[3]
Herman JG, Baylin SB. Gene silencing in cancer in association with promoter hypermethylation. N Engl J Med 2003; 349(21): 2042-54.
[http://dx.doi.org/10.1056/NEJMra023075] [PMID: 14627790]
[4]
Ghandi M, Huang FW, Jané-Valbuena J, et al. Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature 2019; 569(7757): 503-8.
[http://dx.doi.org/10.1038/s41586-019-1186-3] [PMID: 31068700]
[5]
Feng P, Ding H, Chen W, Lin H. Identifying RNA 5-methylcytosine sites via pseudo nucleotide compositions. Mol Biosyst 2016; 12(11): 3307-11.
[http://dx.doi.org/10.1039/C6MB00471G] [PMID: 27531244]
[6]
Zhang M, Xu Y, Li L, Liu Z, Yang X, Yu DJ. Accurate RNA 5-methylcytosine site prediction based on heuristic physical-chemical proper-ties reduction and classifier ensemble. Anal Biochem 2018; 550(1): 41-8.
[http://dx.doi.org/10.1016/j.ab.2018.03.027] [PMID: 29649472]
[7]
Qiu WR, Jiang SY, Xu ZC, Xiao X, Chou KC. iRNAm5C-PseDNC: Identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition. Oncotarget 2017; 8(25): 41178-88.
[http://dx.doi.org/10.18632/oncotarget.17104] [PMID: 28476023]
[8]
Fang T, Zhang Z, Sun R, et al. RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition. Mol Ther Nucleic Acids 2019; 18(6): 739-47.
[http://dx.doi.org/10.1016/j.omtn.2019.10.008] [PMID: 31726390]
[9]
Akbar S, Hayat M, Iqbal M, Tahir M. Irna-psetnc: Identification of rna 5-methylcytosine sites using hybrid vector space of pseudo nucle-otide composition. Front Comput Sci 2019; 14(2): 451-60.
[http://dx.doi.org/10.1007/s11704-018-8094-9]
[10]
Chen X, Xiong Y, Liu Y, Chen Y, Bi S, Zhu X. m5CPred-SVM: A novel method for predicting m5C sites of RNA. BMC Bioinformatics 2020; 21(1): 489.
[http://dx.doi.org/10.1186/s12859-020-03828-4] [PMID: 33126851]
[11]
Dou L, Li X, Ding H, Xu L, Xiang H. Prediction of m5c modifications in rna sequences by combining multiple sequence features. Mol Ther Nucleic Acids 2020; 21(21): 332-42.
[http://dx.doi.org/10.1016/j.omtn.2020.06.004] [PMID: 32645685]
[12]
Bhasin M, Zhang H, Reinherz EL, Reche PA. Prediction of methylated CpGs in DNA sequences using a support vector machine. FEBS Lett 2005; 579(20): 4302-8.
[http://dx.doi.org/10.1016/j.febslet.2005.07.002] [PMID: 16051225]
[13]
Lv H, Zhang ZM, Li SH, Tan JX, Chen W, Lin H. Evaluation of different computational methods on 5-methylcytosine sites identification. Brief Bioinform 2020; 21(3): 982-95.
[http://dx.doi.org/10.1093/bib/bbz048] [PMID: 31157855]
[14]
Angermueller C, Lee HJ, Reik W, Stegle O. DeepCpG: Accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol 2017; 18(1): 67.
[http://dx.doi.org/10.1186/s13059-017-1189-z] [PMID: 28395661]
[15]
Zhang L, Xiao X, Xu ZC. Ipromoter-5mc: A novel fusion decision predictor for the identification of 5-methylcytosine sites in genome-wide dna promoters. Front Cell Dev Biol 2020; 8: 614.
[http://dx.doi.org/10.3389/fcell.2020.00614] [PMID: 32850787]
[16]
Cao Y, Geddes TA, Yang JYH, Yang P. Ensemble deep learning in bioinformatics. Nat Mach Intell 2020; 2(9): 1-9.
[http://dx.doi.org/10.1038/s42256-020-0217-y]
[17]
Dietterich TG. Ensemble methods in machine learning.Multiple Classifier Systems. 2000; 1857: pp. 1-15.
[http://dx.doi.org/10.1007/3-540-45014-9_1]
[18]
Wolpert DH. Stacked generalization. Neural Netw 2017; 5(2): 241-59.
[http://dx.doi.org/10.1016/S0893-6080(05)80023-1]
[19]
Saunders C, Stitson MO, Weston J, Holloway R, Bottou L, Scholkopf B, et al. Support vector machine. Comput Sci 2002; 1(4): 1-28.
[http://dx.doi.org/10.1007/978-3-642-27733-7_299-3]
[20]
Chen T, Guestrin C. Xgboost: A scalable tree boosting system. KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785-94.
[http://dx.doi.org/10.1145/2939672.2939785]
[21]
Qi Meng. LightGBM: A highly efficient gradient boosting decision tree. 31st Conference on Neural Information Processing Systems. Long Beach, CA, USA. 2018; pp. 3149-57.
[22]
Yu HF, Huang FL, Lin CJ. Dual coordinate descent methods for logistic regression and maximum entropy models. Mach Learn 2011; 85(1-2): 41-75.
[http://dx.doi.org/10.1007/s10994-010-5221-8]
[23]
Murphey YL, Guo H, Feldkamp LA. Neural learning from unbalanced data. Appl Intell 2004; 21(2): 117-28.
[http://dx.doi.org/10.1023/B:APIN.0000033632.42843.17]
[24]
Zhu T, Lin Y, Liu Y. Synthetic minority oversampling technique for multiclass imbalance problems. Pattern Recognit 2017; 72: 327-40.
[http://dx.doi.org/10.1016/j.patcog.2017.07.024]
[25]
Qiu W, Lv Z, Hong Y, Jia J, Xiao X. BOW-GBDT: A GBDT classifier combining with artificial neural network for identifying GPCR-drug interaction based on wordbook learning from sequences. Front Cell Dev Biol 2021; 8623858
[http://dx.doi.org/10.3389/fcell.2020.623858] [PMID: 33598456]
[26]
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. Smote: Synthetic minority over-sampling technique. J Artif Intell Res 2002; 16(1): 321-57.
[http://dx.doi.org/10.1613/jair.953]
[27]
Chen Z, Zhao P, Li F, et al. iLearn: An integrated platform and meta-learner for feature engineering, machine-learning analysis and model-ing of DNA, RNA and protein sequence data. Brief Bioinform 2020; 21(3): 1047-57.
[http://dx.doi.org/10.1093/bib/bbz041] [PMID: 31067315]
[28]
Xiao X, Ye HX, Liu Z, Jia JH, Chou KC. iROS-gPseKNC: Predicting replication origin sites in DNA by incorporating dinucleotide position-specific propensity into general pseudo nucleotide composition. Oncotarget 2016; 7(23): 34180-9.
[http://dx.doi.org/10.18632/oncotarget.9057] [PMID: 27147572]
[29]
Safavian SR, Landgrebe D. A survey of decision tree classifier methodology. IEEE Trans Syst Man Cybern 2002; 21(3): 660-74.
[http://dx.doi.org/10.1109/21.97458]
[30]
Friedman JH. Greedy function approximation: A gradient boosting machine. Ann Stat 2001; 29(5): 1189-232.
[http://dx.doi.org/10.1214/aos/1013203451]
[31]
Le NQK, Ho QT. Deep transformers and convolutional neural network in identifying DNA N6-methyladenine sites in cross-species ge-nomes. Methods 2021; S1046-2023(21): 00274-7.
[http://dx.doi.org/10.1016/j.ymeth.2021.12.004]
[32]
Tng SS, Le NQK, Yeh HY, Chua MCH. Improved prediction model of protein lysine crotonylation sites using bidirectional recurrent neu-ral networks. J Proteome Res 2021; 2021(Nov): 23.
[http://dx.doi.org/10.1021/acs.jproteome.1c00848] [PMID: 34812044]
[33]
Le NQ, Nguyen BP. Prediction of FMN Binding Sites in Electron Transport Chains Based on 2-D CNN and PSSM Profiles. IEEE/ACM Trans Comput Biol Bioinform 2021; 18(6): 2189-97.
[http://dx.doi.org/10.1109/TCBB.2019.2932416]
[34]
Le NQK, Yapp EKY, Ou YY, Yeh HY, Lee K. iMotor-CNN: Identifying molecular functions of cytoskeleton motor proteins using 2D convolutional neural network via Chou’s 5-step rule. Anal Biochem 2019; 575: 17-26.
[http://dx.doi.org/10.1016/j.ab.2019.03.017] [PMID: 30930199]

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy