Abstract
Background: N4-methylcytosine (4mC) is one of the most widespread DNA methylation modifications, which plays an important role in DNA replication and repair, epigenetic inheritance, gene expression levels and regulation of transcription. Although biological experiments can identify potential 4mC modification sites, they are limited due to the experimental environment and labor intensive. Therefore, it is crucial to construct a computational model to identify the 4mC sites.
Objective: Although some computational methods have been proposed to identify the 4mC sites, some problems should not be ignored, such as: (1) a large number of unknown nucleotides exist in the biological sequence; (2) a large number of zeros exist in the previous encoding technologies; (3) sequence distribution information is important to identify 4mC sites. Considering these aspects, we propose a computational model based on a novel encoding strategy with position specific information to identify 4mC sites.
Methods: We constructed an accurate computational model i4mC-CPXG based on extreme gradient boosting. Two aspects of feature vectors are extracted according to nucleotide information and position specific information. From the aspect of nucleotide information, we used prior information to identify the base type of unknown nucleotide and decrease the influence of invalid information caused by lots of zeros. From the aspect of position specific information, the vector was designed carefully to express the base distribution and arrangement. Then the feature vector fused by nucleotide information and position specific information was input into extreme gradient boosting to construct the model.
Results: The accuracy of i4mC-CPXG is 82.49% on independent dataset. The result was better than model i4mC-w2vec which was the best model in the imbalanced dataset with the ratio of 1:15. Meanwhile, our model achieved good performance on other species. These results validated the effectiveness of i4mC-CPXG.
Conclusion: Our method is effective to identify potential 4mC modification sites due to the proposed new encoding strategy fused position specific information. The satisfactory prediction results of balanced datasets, imbalanced datasets and other species datasets indicate that i4mC-CPXG is valuable to provide a reasonable supplement for biology research.
Graphical Abstract
[http://dx.doi.org/10.1016/j.chemolab.2019.04.007]
[http://dx.doi.org/10.1371/journal.pgen.1002781] [PMID: 22737091]
[http://dx.doi.org/10.1038/nrg2341] [PMID: 18463664]
[http://dx.doi.org/10.1038/nrg3230] [PMID: 22641018]
[http://dx.doi.org/10.2337/db09-1003] [PMID: 19940235]
[http://dx.doi.org/10.1016/j.csbj.2021.03.015] [PMID: 33868598]
[http://dx.doi.org/10.2144/000112807] [PMID: 18474038]
[http://dx.doi.org/10.1128/jb.169.3.939-943.1987] [PMID: 3029036]
[http://dx.doi.org/10.1073/pnas.77.2.1063] [PMID: 6987663]
[http://dx.doi.org/10.1073/pnas.80.15.4639] [PMID: 6308634]
[http://dx.doi.org/10.1093/genetics/104.4.571] [PMID: 6225697]
[http://dx.doi.org/10.1016/j.chembiol.2015.11.007] [PMID: 26933737]
[http://dx.doi.org/10.1142/S0219720021500190] [PMID: 34291710]
[http://dx.doi.org/10.3389/fgene.2014.00126] [PMID: 24860595]
[http://dx.doi.org/10.1038/nmeth.1459] [PMID: 20453866]
[http://dx.doi.org/10.1146/annurev-phyto-080508-081936] [PMID: 19400638]
[http://dx.doi.org/10.1007/s10541-005-0178-0] [PMID: 16097936]
[http://dx.doi.org/10.1093/bioinformatics/btx479] [PMID: 28961687]
[http://dx.doi.org/10.1093/bioinformatics/btz408] [PMID: 31099381]
[http://dx.doi.org/10.1016/j.omtn.2019.04.019] [PMID: 31146255]
[http://dx.doi.org/10.1016/j.ijbiomac.2019.12.009] [PMID: 31805335]
[http://dx.doi.org/10.3390/cells9081756] [PMID: 32707969]
[http://dx.doi.org/10.1093/bioinformatics/bts565] [PMID: 23060610]
[http://dx.doi.org/10.2174/1574893615666210108093950]
[http://dx.doi.org/10.1186/s12859-019-2999-7] [PMID: 31362694]
[http://dx.doi.org/10.1016/j.artmed.2021.102034] [PMID: 33685590]
[http://dx.doi.org/10.1007/BF00994018]
[http://dx.doi.org/10.18632/oncotarget.11975] [PMID: 27626500]
[http://dx.doi.org/10.1109/TAES.2007.357120]
[http://dx.doi.org/10.1007/BF00058655]
[http://dx.doi.org/10.1016/S0731-7085(99)00272-1] [PMID: 10815714]
[http://dx.doi.org/10.1007/978-3-642-24797-2_4]
[http://dx.doi.org/10.1016/0169-7439(93)80052-J]
[http://dx.doi.org/10.3102/1076998619872761]
[http://dx.doi.org/10.1093/bioinformatics/btl151] [PMID: 16632492]
[http://dx.doi.org/10.1109/ACCESS.2020.2966576]
[http://dx.doi.org/10.1093/bioinformatics/bty824] [PMID: 30239627]
[http://dx.doi.org/10.1093/bioinformatics/btaa507] [PMID: 32413127]
[http://dx.doi.org/10.3390/genes12081117] [PMID: 34440291]