Abstract
Introduction: More recent self-supervised deep language models, such as Bidirectional Encoder Representations from Transformers (BERT), have performed the best on some language tasks by contextualizing word embeddings for a better dynamic representation. Their proteinspecific versions, such as ProtBERT, generated dynamic protein sequence embeddings, which resulted in better performance for several bioinformatics tasks. Besides, a number of different protein post-translational modifications are prominent in cellular tasks such as development and differentiation. The current biological experiments can detect these modifications, but within a longer duration and with a significant cost.
Methods: In this paper, to comprehend the accompanying biological processes concisely and more rapidly, we propose DEEPPTM to predict protein post-translational modification (PTM) sites from protein sequences more efficiently. Different than the current methods, DEEPPTM enhances the modification prediction performance by integrating specialized ProtBERT-based protein embeddings with attention-based vision transformers (ViT), and reveals the associations between different modification types and protein sequence content. Additionally, it can infer several different modifications over different species.
Results: Human and mouse ROC AUCs for predicting Succinylation modifications were 0.988 and 0.965 respectively, once 10-fold cross-validation is applied. Similarly, we have obtained 0.982, 0.955, and 0.953 ROC AUC scores on inferring ubiquitination, crotonylation, and glycation sites, respectively. According to detailed computational experiments, DEEPPTM lessens the time spent in laboratory experiments while outperforming the competing methods as well as baselines on inferring all 4 modification sites. In our case, attention-based deep learning methods such as vision transformers look more favorable to learning from ProtBERT features than more traditional deep learning and machine learning techniques.
Conclusion: Additionally, the protein-specific ProtBERT model is more effective than the original BERT embeddings for PTM prediction tasks. Our code and datasets can be found at https://github.com/seferlab/deepptm.
[http://dx.doi.org/10.1038/s41570-020-00223-8] [PMID: 37127974]
[http://dx.doi.org/10.1093/database/baab012] [PMID: 33826699]
[http://dx.doi.org/10.1016/j.cell.2007.02.005] [PMID: 17320507]
[http://dx.doi.org/10.2174/1574893612666170707094916]
[http://dx.doi.org/10.1126/sciadv.aay4697] [PMID: 32201722]
[http://dx.doi.org/10.1093/nar/gkab849]
[http://dx.doi.org/10.1093/nar/gkab1017]
[http://dx.doi.org/10.1186/s12859-018-2547-x] [PMID: 30717650]
[http://dx.doi.org/10.1038/nprot.2017.147] [PMID: 29446774]
[http://dx.doi.org/10.1038/nrm.2016.81]
[http://dx.doi.org/10.1038/s41598-017-09918-3] [PMID: 28860605]
[http://dx.doi.org/10.1073/pnas.1519858113]
[http://dx.doi.org/10.1021/bi00413a052]
[http://dx.doi.org/10.1002/bip.22256] [PMID: 23576281]
[http://dx.doi.org/10.1385/1-59259-828-5:099]
[http://dx.doi.org/10.1038/35055104] [PMID: 11175752]
[http://dx.doi.org/10.1016/S0076-6879(05)02007-0]
[http://dx.doi.org/10.1093/bioinformatics/btab083]
[http://dx.doi.org/10.1371/journal.pone.0141287]
[http://dx.doi.org/10.1186/s12859-019-3220-8] [PMID: 31847804]
[http://dx.doi.org/10.1093/bioinformatics/btac020]
[http://dx.doi.org/10.1109/TPAMI.2021.3095381] [PMID: 34232869]
[http://dx.doi.org/10.1109/TCBB.2023.3237769] [PMID: 37819796]
[http://dx.doi.org/10.1186/s12859-023-05375-0] [PMID: 37291492]
[http://dx.doi.org/10.7554/eLife.82819] [PMID: 36651724]
[http://dx.doi.org/10.1016/j.compbiolchem.2022.107717] [PMID: 35802991]
[http://dx.doi.org/10.1093/bioinformatics/btab823]
[http://dx.doi.org/10.1002/pmic.202300011]
[http://dx.doi.org/10.1002/pmic.202100232] [PMID: 34730875]
[http://dx.doi.org/10.1093/bib/bby089]
[http://dx.doi.org/10.1261/rna.069112.118] [PMID: 30425123]
[http://dx.doi.org/10.1038/s41594-018-0084-y] [PMID: 29967540]
[http://dx.doi.org/10.1038/s41598-022-21366-2] [PMID: 36209286]
[http://dx.doi.org/10.1186/s12859-020-3342-z] [PMID: 32321437]
[http://dx.doi.org/10.1093/bib/bbaa255]
[http://dx.doi.org/10.1093/bioinformatics/btab712]
[http://dx.doi.org/10.1186/s12859-021-04101-y] [PMID: 33789579]
[http://dx.doi.org/10.1093/bioinformatics/bts565]
[http://dx.doi.org/10.3115/v1/D14-1162]
[http://dx.doi.org/10.18653/v1/D19-5821]
[http://dx.doi.org/10.18653/v1/2022.emnlp-main.603]
[http://dx.doi.org/10.3390/app13095521]
[http://dx.doi.org/10.1007/978-3-031-20077-9_17]
[http://dx.doi.org/10.1016/j.neucom.2020.09.056]
[http://dx.doi.org/10.1145/3505244]
[http://dx.doi.org/10.1145/2939672.2939785]
[http://dx.doi.org/10.18653/v1/2020.emnlp-demos.6]
[http://dx.doi.org/10.1016/j.chemolab.2005.05.004]
[http://dx.doi.org/10.1142/S0219720022500032] [PMID: 35191361]
[http://dx.doi.org/10.1093/bioinformatics/btab203]
[http://dx.doi.org/10.1093/bioinformatics/btl151]