Abstract
Introduction: Protein ubiquitylation is an important post-translational modification (PTM), which is considered to be one of the most important processes regulating cell function and various diseases. Therefore, accurate prediction of ubiquitylation proteins and their PTM sites is of great significance for the study of basic biological processes and the development of related drugs. Researchers have developed some large-scale computational methods to predict ubiquitylation sites, but there is still much room for improvement. Much of the research related to ubiquitylation is cross-species while the life pattern is diversified, and the prediction method always shows its specificity in practical application. This study just aims at the issue of plants and has constructed computational methods for identifying ubiquitylation protein and ubiquitylation sites.
Methods: In this work, we constructed two predictive models to identify plant ubiquitylation proteins and sites. First, in the ubiquitylation proteins prediction model, in order to better reflect protein sequence information and obtain better prediction results, the KNN scoring matrix model based on functional domain Gene Ontology (GO) annotation and word embedding model, i.e. Skip-Gram and Continuous Bag of Words (CBOW), are used to extract the features, and the light gradient boosting machine (LGBM) is selected as the ubiquitylation proteins prediction engine.
Results: As a result, accuracy (ACC), Precision, recall rate (Recall), F1_score and AUC are respectively 85.12%, 80.96%, 72.80%, 76.37% and 0.9193 in the 10-fold cross-validations on independent dataset. In the ubiquitylation sites prediction model, Skip-Gram, CBOW and enhanced amino acid composition (EAAC) feature extraction codes were used to extract protein sequence fragment features, and the predicted results on training and independent test data have also achieved good performance.
Conclusion: In a word, the comparison results demonstrate that our models have a decided advantage in predicting ubiquitylation proteins and sites, and it may provide useful insights for studying the mechanisms and modulation of ubiquitination pathways. The datasets and source codes used in this study are available at: https://github.com/gmywqk/Ub-PS-Fuse.
[http://dx.doi.org/10.1111/tpj.14593] [PMID: 31677306]
[http://dx.doi.org/10.1155/2018/5125103]
[http://dx.doi.org/10.1038/nbt.2061] [PMID: 22158364]
[http://dx.doi.org/10.1016/S0955-0674(03)00042-5] [PMID: 12787778]
[http://dx.doi.org/10.3390/ijms21113904] [PMID: 32486158]
[http://dx.doi.org/10.1038/s41467-017-01074-6] [PMID: 29051491]
[http://dx.doi.org/10.1016/j.mam.2018.07.002] [PMID: 30059710]
[http://dx.doi.org/10.1038/nsmb.2808] [PMID: 24699077]
[http://dx.doi.org/10.1186/1471-2105-9-310] [PMID: 18625080]
[http://dx.doi.org/10.3390/ijms18071476] [PMID: 28698506]
[http://dx.doi.org/10.1186/2045-3701-4-59] [PMID: 25309720]
[http://dx.doi.org/10.1038/s41418-020-00706-7] [PMID: 33414510]
[http://dx.doi.org/10.1016/j.cmet.2018.06.014] [PMID: 30017357]
[http://dx.doi.org/10.1126/science.1204903] [PMID: 21680842]
[http://dx.doi.org/10.1104/pp.112.199281] [PMID: 22689893]
[http://dx.doi.org/10.1186/s12859-019-2700-1] [PMID: 30841845]
[http://dx.doi.org/10.1093/bioinformatics/bty1051] [PMID: 30601936]
[http://dx.doi.org/10.1093/bioinformatics/btt196] [PMID: 23626001]
[http://dx.doi.org/10.2174/1389202919666191014091250] [PMID: 32476995]
[http://dx.doi.org/10.1002/minf.201600010] [PMID: 28488814]
[http://dx.doi.org/10.3389/fbioe.2019.00311] [PMID: 31867311]
[http://dx.doi.org/10.3934/mbe.2021450] [PMID: 34814339]
[http://dx.doi.org/10.3389/fendo.2022.849549] [PMID: 35557849]
[http://dx.doi.org/10.3389/fcell.2020.572195] [PMID: 33102477]
[http://dx.doi.org/10.3390/genes12050717] [PMID: 34064731]
[http://dx.doi.org/10.3934/mbe.2022035] [PMID: 34903012]
[http://dx.doi.org/10.1016/j.jgg.2017.03.007] [PMID: 28529077]
[http://dx.doi.org/10.1007/978-1-4939-3167-5_2]
[http://dx.doi.org/10.1093/bioinformatics/bty178] [PMID: 29584811]
[http://dx.doi.org/10.1007/s12652-018-1095-6]
[http://dx.doi.org/10.1093/nar/gkw1099] [PMID: 27899622]
[http://dx.doi.org/10.4236/ns.2018.109035]
[http://dx.doi.org/10.1016/j.chemolab.2020.104175]
[http://dx.doi.org/10.1021/acs.jproteome.0c00314] [PMID: 33090794]
[http://dx.doi.org/10.3389/fimmu.2018.01783] [PMID: 30108593]
[http://dx.doi.org/10.1093/bib/bbaa049] [PMID: 32363397]
[http://dx.doi.org/10.1093/bib/bbaa125] [PMID: 32599617]
[http://dx.doi.org/10.1007/s11227-022-04326-5]
[http://dx.doi.org/10.1016/j.ab.2020.113903] [PMID: 32805274]
[http://dx.doi.org/10.1088/1361-6501/ab4a45]
[http://dx.doi.org/10.1016/j.chemolab.2019.06.003]
[http://dx.doi.org/10.3390/math8050765]
[http://dx.doi.org/10.1093/nar/gkg600] [PMID: 12824396]
[http://dx.doi.org/10.1093/bioinformatics/18.5.689] [PMID: 12050065]
[http://dx.doi.org/10.2307/2530946]
[http://dx.doi.org/10.1002/widm.1072]
[http://dx.doi.org/10.1016/j.enbuild.2017.04.038]
[http://dx.doi.org/10.1038/nbt1206-1565] [PMID: 17160063]
[http://dx.doi.org/10.1287/mksc.1050.0123]
[http://dx.doi.org/10.1145/500141.500159]
[http://dx.doi.org/10.1093/bioinformatics/bty977] [PMID: 30520972]
[http://dx.doi.org/10.1093/bib/bbaa099] [PMID: 32578842]
[http://dx.doi.org/10.32604/iasc.2022.017691]
[http://dx.doi.org/10.1007/978-3-642-24797-2_4]
[http://dx.doi.org/10.1093/bioinformatics/btab712] [PMID: 34643684]
[http://dx.doi.org/10.1093/bib/bbab209] [PMID: 34086856]
[http://dx.doi.org/10.1021/acs.jpca.1c02419] [PMID: 34142824]