Abstract
Background: Enhancers are key cis-function elements of DNA structure that are crucial in gene regulation and the function of a promoter in eukaryotic cells. Availability of accurate identification of the enhancers would facilitate the understanding of DNA functions and their physiological roles. Previous studies have revealed the effectiveness of computational methods for identifying enhancers in other organisms. To date, a huge number of enhancers remain unknown, especially in the field of plant species.
Objective: In this study, the aim is to build an efficient attention-based neural network model for the identification of Arabidopsis thaliana enhancers.
Methods: A sequence-based model using convolutional and recurrent neural networks was proposed for the identification of enhancers. The input DNA sequences are represented as feature vectors by 4-mer. A neural network model consists of CNN and Bi-RNN as sequence feature extractors, and the attention mechanism is suggested to improve the prediction performance.
Results: We implemented an ablation study on validation set to select and evaluate the effectiveness of our proposed model. Moreover, our model showed remarkable performance on the test set achieving the Mcc of 0.955, the AUPRC of 0.638, and the AUROC of 0.837, which are significantly higher than state-of-the-art methods, respectively.
Conclusion: The proposed computational framework aims at solving similar problems in non-coding genomic regions, thereby providing valuable insights into the prediction about the enhancers of plants.
Keywords: Enhancer, Arabidopsis thaliana, DNA sequence, deep learning, attention mechanism, transcriptional regulation.
Graphical Abstract
[http://dx.doi.org/10.1038/nrg3682] [PMID: 24614317]
[http://dx.doi.org/10.1016/j.molcel.2013.01.038] [PMID: 23473601]
[http://dx.doi.org/10.1038/nrg3458] [PMID: 23503198]
[http://dx.doi.org/10.1086/426833] [PMID: 15549674]
[http://dx.doi.org/10.1093/bioinformatics/btq248] [PMID: 20453004]
[http://dx.doi.org/10.1128/MCB.01127-12] [PMID: 23045397]
[http://dx.doi.org/10.1093/nar/gkv1144]
[http://dx.doi.org/10.1105/tpc.15.00537] [PMID: 26373455]
[http://dx.doi.org/10.1016/j.cell.2012.12.009] [PMID: 23332764]
[http://dx.doi.org/10.1038/nrg3306] [PMID: 23090257]
[http://dx.doi.org/10.1016/j.gde.2009.09.006] [PMID: 19854636]
[http://dx.doi.org/10.1038/ng.1006] [PMID: 22138689]
[http://dx.doi.org/10.1093/bib/bbk007] [PMID: 16761367]
[http://dx.doi.org/10.1101/gr.121905.111] [PMID: 21875935]
[http://dx.doi.org/10.1371/journal.pcbi.1003711] [PMID: 25033408]
[http://dx.doi.org/10.1093/bioinformatics/btv604] [PMID: 26476782]
[http://dx.doi.org/10.1093/bioinformatics/bty458] [PMID: 29878118]
[http://dx.doi.org/10.1038/s41592-020-0907-8] [PMID: 32737473]
[PMID: 27473064]
[http://dx.doi.org/10.1093/bioinformatics/btx105] [PMID: 28334114]
[http://dx.doi.org/10.1371/journal.pcbi.1003677] [PMID: 24967590]
[http://dx.doi.org/10.1186/s12864-019-6336-3] [PMID: 31874637]
[http://dx.doi.org/10.1109/ACCESS.2020.2982666]
[http://dx.doi.org/10.1016/j.ab.2019.02.017] [PMID: 30822398]
[http://dx.doi.org/10.18653/v1/D16-1244]
[http://dx.doi.org/10.1016/j.cpb.2015.10.001]
[http://dx.doi.org/10.6026/97320630005234] [PMID: 21364823]
[http://dx.doi.org/10.1093/bioinformatics/bts565] [PMID: 23060610]
[http://dx.doi.org/10.3390/genes8040122] [PMID: 28422050]
[http://dx.doi.org/10.3389/fmicb.2018.00872] [PMID: 29774017]
[http://dx.doi.org/10.3390/cells8070767] [PMID: 31340596]
[http://dx.doi.org/10.1093/bioinformatics/btv153] [PMID: 25810428]
[http://dx.doi.org/10.1093/bioinformatics/btz246] [PMID: 30994882]
[http://dx.doi.org/10.3390/cells8121635] [PMID: 31847308]
[http://dx.doi.org/10.1007/978-3-030-05318-5_1]
[http://dx.doi.org/10.1186/s12859-017-1878-3] [PMID: 29219068]
[http://dx.doi.org/10.1093/bioinformatics/btaa914] [PMID: 33119044]
[http://dx.doi.org/10.1016/j.ab.2021.114120] [PMID: 33535061]
[PMID: 31588505]