Abstract
Promoters are DNA fragments located near the transcription initiation site, they can be divided into strong promoter type and weak promoter type according to transcriptional activation and expression level. Identifying promoters and their strengths in DNA sequences is essential for understanding gene expression regulation. Therefore, it is crucial to further improve predictive quality of predictors for real-world application requirements. Here, we constructed the latest training dataset based on the RegalonDB website, where all the promoters in this dataset have been experimentally validated, and their sequence similarity is less than 85%. We used one-hot and nucleotide chemical property and density (NCPD) to represent DNA sequence samples. Additionally, we proposed an ensemble deep learning framework containing a multi-head attention module, long short-term memory present, and a convolutional neural network module.
The results showed that iPSI(2L)-EDL outperformed other existing methods for both promoter prediction and identification of strong promoter type and weak promoter type, the AUC and MCC for the iPSI(2L)-EDL in identifying promoter were improved by 2.23% and 2.96% compared to that of PseDNC-DL on independent testing data, respectively, while the AUC and MCC for the iPSI(2L)- EDL were increased by 3.74% and 5.86% in predicting promoter strength type, respectively. The results of ablation experiments indicate that CNN plays a crucial role in recognizing promoters, the importance of different input positions and long-range dependency relationships among features are helpful for recognizing promoters.
Furthermore, to make it easier for most experimental scientists to get the results they need, a userfriendly web server has been established and can be accessed at http://47.94.248.117/IPSW(2L)-EDL.
[http://dx.doi.org/10.1093/bioinformatics/btw629] [PMID: 27694198]
[http://dx.doi.org/10.1101/gad.303149.117] [PMID: 28808065]
[http://dx.doi.org/10.1093/nar/gki937] [PMID: 16314312]
[http://dx.doi.org/10.1016/j.ygeno.2009.08.011] [PMID: 19720141]
[http://dx.doi.org/10.1093/bioinformatics/btp120] [PMID: 19289445]
[http://dx.doi.org/10.1038/nrg3306] [PMID: 23090257]
[http://dx.doi.org/10.1093/nar/gku1019] [PMID: 25361964]
[http://dx.doi.org/10.1186/s12918-018-0570-1] [PMID: 29745856]
[http://dx.doi.org/10.1093/bioinformatics/btx579] [PMID: 28968797]
[PMID: 32976109]
[http://dx.doi.org/10.1016/j.omtn.2019.08.008] [PMID: 31536883]
[http://dx.doi.org/10.1038/nmeth.1906] [PMID: 22373907]
[http://dx.doi.org/10.1093/bioinformatics/btx603] [PMID: 29028889]
[http://dx.doi.org/10.1093/bioinformatics/btaa609] [PMID: 32614400]
[http://dx.doi.org/10.1093/bioinformatics/btx105] [PMID: 28334114]
[http://dx.doi.org/10.1016/j.jtbi.2018.12.034] [PMID: 30590059]
[http://dx.doi.org/10.1371/journal.pone.0171410] [PMID: 28158264]
[http://dx.doi.org/10.3389/fbioe.2019.00305] [PMID: 31750297]
[http://dx.doi.org/10.1093/bib/bbaa299] [PMID: 33227813]
[http://dx.doi.org/10.1016/j.compbiolchem.2022.107770] [PMID: 36116322]
[http://dx.doi.org/10.1186/s12864-022-08829-6] [PMID: 36192696]
[http://dx.doi.org/10.1016/j.ygeno.2018.12.001] [PMID: 30529532]
[http://dx.doi.org/10.1016/j.ab.2021.114335] [PMID: 34389299]
[http://dx.doi.org/10.1016/j.ygeno.2019.08.009] [PMID: 31437540]
[http://dx.doi.org/10.1016/j.compbiolchem.2022.107732] [PMID: 35863177]
[http://dx.doi.org/10.1099/mgen.0.000833] [PMID: 35584008]
[http://dx.doi.org/10.1093/bib/bbl003] [PMID: 16772261]
[http://dx.doi.org/10.1016/j.ab.2019.02.017] [PMID: 30822398]
[http://dx.doi.org/10.1016/j.ygeno.2018.07.011] [PMID: 30059731]
[http://dx.doi.org/10.1016/j.ymeth.2022.01.001] [PMID: 34998983]
[http://dx.doi.org/10.1007/s12539-022-00520-4] [PMID: 35488998]
[http://dx.doi.org/10.1016/j.compbiolchem.2022.107623] [PMID: 35065417]
[http://dx.doi.org/10.1016/j.ygeno.2022.110384] [PMID: 35533969]
[http://dx.doi.org/10.1016/j.chemolab.2020.104034]
[http://dx.doi.org/10.1038/s41598-021-84188-8] [PMID: 33633341]
[http://dx.doi.org/10.1016/S0168-1605(00)00206-3] [PMID: 10791710]
[http://dx.doi.org/10.1038/s41589-020-0639-1] [PMID: 32895498]
[http://dx.doi.org/10.1073/pnas.75.10.4724] [PMID: 368797]