Citrullination Site Prediction by Incorporating Sequence Coupled Effects into PseAAC and Resolving Data Imbalance Issue

Md.    Al Mehedi    Hasan; Md    Khaled    Ben Islam; Julia       Rahman; Shamim       Ahmad

doi:10.2174/1574893614666191202152328

Abstract

Background: Post-translational modification is one of the bio-molecular mechanisms in living organisms, which incorporate functional diversity in proteins as well as regulate cellular processes. Transformation of arginine residue to citrulline in protein is such a modification.

Objective: Our objective is to identify citrullinated arginine residue sites quickly and accurately.

Methods: In this study, a novel computational tool, abbreviated as predCitru-Site, has been developed to predict citrullination sites. This technique effectively has incorporated the sequencecoupling effect of surrounding amino acids of arginine residues as well as optimizes skewed training citrullination dataset for prediction quality improvement. The performance of predCitru- Site has been measured from the average of 5 complete runs of the 10-fold cross-validation test to comply with existing tools.

Results and Conclusion: predCitru-Site has achieved 97.6% sensitivity, 98.9% specificity, and overall accuracy of 98.5%. With Matthew’s correlation coefficient of 0.967, it has also shown an area under the receiver operator characteristics curve of 0.997. Compared with existing tools, predCitru-Site significantly outperforms on the same benchmark dataset. It also shows significant improvement in the case of independent tests in all performance metrics (around 50% higher in AUC). These results suggest that our method is promising and can be used as a complementary technique for fast exploration of citrullination in arginine residue. A user-friendly web server has also been deployed at http://research.ru.ac.bd/predCitru-Site/ for the convenience of experimental scientists.

Keywords: Citrullination sites prediction, sequence-coupling model, general PseAAC, data imbalance issue, support vector machine, computational.

« Previous Next »

Graphical Abstract

[1] 
Lin H, Caroll KS. Introduction: posttranslational protein modification. Chem Rev  2018; 118(3): 887-8.
[http://dx.doi.org/10.1021/acs.chemrev.7b00756] 
[2] 
Krassowski M, Paczkowska M, Cullion K, et al. ActiveDriverDB: human disease mutations and genome variation in post-translational modification sites of proteins. Nucleic Acids Res  2018; 46(D1): D901-10.
[http://dx.doi.org/10.1093/nar/gkx973] 
[3] 
Cau L, Méchin MC, Simon M. Peptidylarginine deiminases and deiminated proteins at the epidermal barrier. Exp Dermatol  2018; 27(8): 852-8.
[http://dx.doi.org/10.1111/exd.13684] 
[4] 
Ju Z, Wang SY. Prediction of citrullination sites by incorporating k-spaced amino acid pairs into Chou’s general pseudo amino acid composition. Gene  2018; 664: 78-83.
[http://dx.doi.org/10.1016/j.gene.2018.04.055] 
[5] 
Clancy KW, Weerapana E, Thompson PR. Detection and identification of protein citrullination in complex biological systems. Curr Opin Chem Biol  2016; 30: 1-6.
[http://dx.doi.org/10.1016/j.cbpa.2015.10.014] 
[6] 
Härmä H, Tong-Ochoa N, van Adrichem AJ, Jelesarov I, Wennerberg K, Kopra K. Toward universal protein post-translational modification detection in high throughput format. Chem Commun (Camb)  2018; 54(23): 2910-3.
[http://dx.doi.org/10.1039/C7CC09575A] 
[7] 
Tutturen AEV. Enrichment and identification of citrullinated proteins in biological samples 2014.
[8] 
Xu H, Zhou J, Lin S, Deng W, Zhang Y, Xue Y. PLMD: An updated data resource of protein lysine modifications. J Genet Genomics  2017; 44(5): 243-50.
[http://dx.doi.org/10.1016/j.jgg.2017.03.007] 
[9] 
Qiu WR, Sun BQ, Tang H, Huang J, Lin H. Identify and analysis crotonylation sites in histone by using support vector machines. Artif Intell Med  2017; 83: 75-81.
[http://dx.doi.org/10.1016/j.artmed.2017.02.007] 
[10] 
Yadav S, Gupta M, Bist AS. Prediction of ubiquitination sites using UbiNets. Adv Fuzzy Syst  2018; 5125103: 1-10.
[http://dx.doi.org/10.1155/2018/5125103] 
[11] 
Chen G, Cao M, Luo K, Wang L, Wen P, Shi S. ProAcePred: prokaryote lysine acetylation sites prediction based on elastic net feature optimization. Bioinformatics  2018; 34(23): 3999-4006.
[http://dx.doi.org/10.1093/bioinformatics/bty444] 
[12] 
Yang Y, Wang H, Ding J, Xu Y. iAcet-Sumo: Identification of lysine acetylation and sumoylation sites in proteins by multi-class transformation methods. Comput Biol Med  2018; 100: 144-51.
[http://dx.doi.org/10.1016/j.compbiomed.2018.07.006] 
[13] 
Chen CW, Tu CH, Chu YW, Eds. Sumoylation Sites Prediction by Machine Learning Approaches. 2018 IEEE International Conference on Consumer Electronics-Taiwan (ICCE-TW) 
[http://dx.doi.org/10.1109/ICCE-China.2018.8448845] 
[14] 
López Y, Sharma A, Dehzangi A, et al. Success: evolutionary and structural properties of amino acids prove effective for succinylation site prediction. BMC Genomics  2018; 19(Suppl. 1): 923.
[http://dx.doi.org/10.1186/s12864-017-4336-8] 
[15] 
Hasan MM, Khatun MS, Mollah MNH, Yong C, Guo D. A systematic identification of species-specific protein succinylation sites using joint element features information. Int J Nanomedicine  2017; 12: 6303-15.
[http://dx.doi.org/10.2147/IJN.S140875] 
[16] 
Hasan MM, Khatun MS, Kurata H. Large-scale assessment of bioinformatics tools for lysine succinylation sites. Cells  2019; 8(2): 95.
[http://dx.doi.org/10.3390/cells8020095] 
[17] 
Zhang Y, Xie R, Wang J, et al. Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework. Brief Bioinform  2019; 20(6): 2185-99.
[18] 
Taherzadeh G, Yang Y, Xu H, Xue Y, Liew AWC, Zhou Y. Predicting lysine-malonylation sites of proteins using sequence and predicted structural features. J Comput Chem  2018; 39(22): 1757-63.
[http://dx.doi.org/10.1002/jcc.25353] 
[19] 
Li F, Li C, Marquez-Lago TT, et al. Quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome. Bioinformatics  2018; 34(24): 4223-31.
[http://dx.doi.org/10.1093/bioinformatics/bty522] 
[20] 
Hasan MM, Rashid MM, Khatun MS, Kurata H. Computational identification of microbial phosphorylation sites by the enhanced characteristics of sequence information. Sci Rep  2019; 9(1): 8258.
[http://dx.doi.org/10.1038/s41598-019-44548-x] 
[21] 
Deng L, Xu X, Liu H. PredCSO: an ensemble method for the prediction of S-sulfenylation sites in proteins. Mol Omics  2018; 14(4): 257-65.
[http://dx.doi.org/10.1039/C8MO00089A] 
[22] 
Al-Barakati HJ, McConnell EW, Hicks LM, Poole LB, Newman RH, Kc DB. SVM-SulfoSite: A support vector machine based predictor for sulfenylation sites. Sci Rep  2018; 8(1): 11288.
[http://dx.doi.org/10.1038/s41598-018-29126-x] 
[23] 
Hasan MM, Zhou Y, Lu X, Li J, Song J, Zhang Z. Computational identification of protein pupylation sites by using profile-based composition of k-spaced amino acid pairs. PLoS One  2015; 10(6) e0129635
[http://dx.doi.org/10.1371/journal.pone.0129635] 
[24] 
Chen Z, Zhou Y, Zhang Z, Song J. Towards more accurate prediction of ubiquitination sites: a comprehensive review of current methods, tools and features. Brief Bioinform  2015; 16(4): 640-57.
[http://dx.doi.org/10.1093/bib/bbu031] 
[25] 
Zhang Q, Sun X, Feng K, et al. Predicting citrullination sites in protein sequences using mRMR method and random forest algorithm. Comb Chem High Throughput Screen  2017; 20(2): 164-73.
[http://dx.doi.org/10.2174/1386207319666161227124350] 
[26] 
Jia C, Zuo Y. Computational prediction of protein O-GlcNAc modification computational. Methods Mol Biol  2018; 1754: 235-46.
[27] 
Xu Y, Song J, Wilson C, Whisstock JC. PhosContext2vec: a distributed representation of residue-level sequence contexts and its application to general and kinase-specific phosphorylation site prediction. Sci Rep  2018; 8(1): 8240.
[http://dx.doi.org/10.1038/s41598-018-26392-7] 
[28] 
Jia J, Liu Z, Xiao X, Liu B, Chou KC. iSuc-PseOpt: Identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. Anal Biochem  2016; 497: 48-56.
[http://dx.doi.org/10.1016/j.ab.2015.12.009] 
[29] 
Jeatrakul P, Wong KW, Fung CC, Takama Y. IEEE Misclassification analysis for the class imbalance problem. World Automation Congress. Kobe, Japan. 2010.
[30] 
Liu Z, Xiao X, Qiu WR, Chou KC. iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition. Anal Biochem  2015; 474: 69-77.
[http://dx.doi.org/10.1016/j.ab.2014.12.009] 
[31] 
Hasan MA, Li J, Ahmad S, Molla MK. predCar-site: Carbonylation sites prediction in proteins using support vector machine with resolving data imbalanced issue. Anal Biochem  2017; 525: 107-13.
[32] 
Hasan MA, Ahmad S, Molla MK. iMulti-HumPhos: a multi-label classifier for identifying human phosphorylated proteins using multiple kernel learning based support vector machines. Mol Biosyst  2017; 13(8): 1608-18.
[http://dx.doi.org/10.1039/C7MB00180K] 
[33] 
Veropoulos K, Campbell C, Cristianini N. Controlling the sensitivity of support vector machines. International Joint Conference on AI  1999.
[34] 
Chou KC. A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins. J Biol Chem  1993; 268(23): 16938-48.
[35] 
Hasan MM, Khatun MS, Mollah MNH, Yong C, Dianjing G, Dianjing G. NTyroSite: Computational identification of protein nitrotyrosine sites using sequence evolutionary features. Molecules  2018; 23(7): 1667.
[http://dx.doi.org/10.3390/molecules23071667] 
[36] 
Dehzangi A, López Y, Lal SP, et al. Improving succinylation prediction accuracy by incorporating the secondary structure via helix, strand and coil, and evolutionary information from profile bigrams. PLoS One  2018; 13(2) e0191900
[http://dx.doi.org/10.1371/journal.pone.0191900] 
[37] 
Ning Q, Zhao X, Bao L, Ma Z, Zhao X. Detecting Succinylation sites from protein sequences using ensemble support vector machine. BMC Bioinformatics  2018; 19(1): 237.
[http://dx.doi.org/10.1186/s12859-018-2249-4] 
[38] 
Jia J, Liu Z, Xiao X, Liu B, Chou KC. iCar-PseCp: identify carbonylation sites in proteins by Monte Carlo sampling and incorporating sequence coupled effects into general PseAAC. Oncotarget  2016; 7(23): 34558-70.
[http://dx.doi.org/10.18632/oncotarget.9148] 
[39] 
Lo Monte M, Manelfi C, Gemei M, Corda D, Beccari AR. ADPredict: ADP-ribosylation site prediction based on physicochemical and structural descriptors. Bioinformatics  2018; 34(15): 2566-74.
[http://dx.doi.org/10.1093/bioinformatics/bty159] 
[40] 
Ju Z, He JJ. Prediction of lysine glutarylation sites by maximum relevance minimum redundancy feature selection. Anal Biochem  2018; 550: 1-7.
[http://dx.doi.org/10.1016/j.ab.2018.04.005] 
[41] 
Liu B, Li K, Huang DS, Chou KC. iEnhancer-EL: identifying enhancers and their strength with ensemble learning approach. Bioinformatics  2018; 34(22): 3835-42.
[http://dx.doi.org/10.1093/bioinformatics/bty458] 
[42] 
Chen W, Feng P, Yang H, Ding H, Lin H, Chou KC. iRNA-3typeA: identifying three types of modification at RNA’s adenosine sites. Mol Ther Nucleic Acids  2018; 11: 468-74.
[http://dx.doi.org/10.1016/j.omtn.2018.03.012] 
[43] 
Su ZD, Huang Y, Zhang ZY, et al. iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC. Bioinformatics  2018; 34(24): 4196-204.
[http://dx.doi.org/10.1093/bioinformatics/bty508] 
[44] 
Mirzaei Mehrabad E, Hassanzadeh R, Eslahchi C. PMLPR: a novel method for predicting subcellular localization based on recommender systems. Sci Rep  2018; 8(1): 12006.
[http://dx.doi.org/10.1038/s41598-018-30394-w] 
[45] 
Tang H, Zou P, Zhang C, Chen R, Chen W, Lin H. Identification of apolipoprotein using feature selection technique. Sci Rep  2016; 6: 30441.
[http://dx.doi.org/10.1038/srep30441] 
[46] 
Rahman J, Mondal MNI, Islam MKB, Hasan MAM, Amin SMS. Gram-positive bacterial protein subcellular localization prediction using features fusion strategy. 9th International Conference on Electrical and Computer Engineering (ICECE)  2016; 20-2. Dec; 2016.
[47] 
Rahman J, Mondal MNI, Islam MKB, Hasan MAM. Feature fusion based SVM classifier for protein subcellular localization prediction. J Integr Bioinform  2016; 13(1): 288.
[http://dx.doi.org/10.1515/jib-2016-288] 
[48] 
Qiu WR, Sun BQ, Xiao X, Xu ZC, Jia JH, Chou KC. iKcr-PseEns: Identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier. Genomics  2018; 110(5): 239-46.
[http://dx.doi.org/10.1016/j.ygeno.2017.10.008] 
[49] 
Qiu WR, Sun BQ, Xiao X, Xu ZC, Chou KC. iPTM-mLys: identifying multiple lysine PTM sites and their different types. Bioinformatics  2016; 32(20): 3116-23.
[http://dx.doi.org/10.1093/bioinformatics/btw380] 
[50] 
Jia C, Zuo Y. S-SulfPred: a sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique. J Theor Biol  2017; 422: 84-9.
[http://dx.doi.org/10.1016/j.jtbi.2017.03.031] 
[51] 
Khan YD, Rasool N, Hussain W, Khan SA, Chou KC. iPhosT-PseAAC: Identify phosphothreonine sites by incorporating sequence statistical moments into PseAAC. Anal Biochem  2018; 550: 109-16.
[http://dx.doi.org/10.1016/j.ab.2018.04.021] 
[52] 
Lee CY, Wang D, Wilhelm M, et al. Mining the human tissue proteome for protein citrullination. Mol Cell Proteomics  2018; 17(7): 1378-91.
[http://dx.doi.org/10.1074/mcp.RA118.000696] 
[53] 
Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol  2011; 273(1): 236-47.
[http://dx.doi.org/10.1016/j.jtbi.2010.12.024] 
[54] 
Chou KC. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics  2005; 21(1): 10-9.
[http://dx.doi.org/10.1093/bioinformatics/bth466] 
[55] 
Chou KC. A sequence-coupled vector-projection model for predicting the specificity of GalNAc-transferase. Protein Sci  1995; 4(7): 1365-83.
[http://dx.doi.org/10.1002/pro.5560040712] 
[56] 
Xu Y, Shao XJ, Wu LY, Deng NY, Chou KC. iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins. PeerJ  2013; 1 e171
[57] 
Xu Y, Wen X, Shao XJ, Deng NY, Chou KC. iHyd-PseAAC: predicting hydroxyproline and hydroxylysine in proteins by incorporating dipeptide position-specific propensity into pseudo amino acid composition. Int J Mol Sci  2014; 15(5): 7594-610.
[http://dx.doi.org/10.3390/ijms15057594] 
[58] 
Chou KC. Prediction of tight turns and their types in proteins. Anal Biochem  2000; 286(1): 1-16.
[http://dx.doi.org/10.1006/abio.2000.4757] 
[59] 
Xu Y, Wen X, Wen LS, Wu LY, Deng NY, Chou KC. iNitro-Tyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PLoS One  2014; 9(8) e105018
[http://dx.doi.org/10.1371/journal.pone.0105018] 
[60] 
Wang SP, Zhang Q, Lu J, Cai YD. Analysis and prediction of nitrated tyrosine sites with the mRMR method and support vector machine algorithm. Curr Bioinform  2018; 13(1): 3-13.
[http://dx.doi.org/10.2174/1574893611666160608075753] 
[61] 
Hasan MAM, Ahmad S, Molla MKI. Protein subcellular localization prediction using multiple kernel learning based support vector machine. Mol Biosyst  2017; 13(4): 785-95.
[http://dx.doi.org/10.1039/C6MB00860G] 
[62] 
Mehedi Hasan A, Ahmad S, Molla KI. Prediction of protein subcellular localization using support vector machine with the choice of proper kernel. BioTechnologia  2017; 98(2): 85-96.
[http://dx.doi.org/10.5114/bta.2017.68307] 
[63] 
Cherkassky V, Ma Y. Practical selection of SVM parameters and noise estimation for SVM regression. Neural Netw  2004; 17(1): 113-26.
[http://dx.doi.org/10.1016/S0893-6080(03)00169-2] 
[64] 
Scholkopf B, Smola AJ. Learning with Kernels: Support Vector Machines. Regularization, Optimization, and Beyond 2001.
[65] 
Vapnik V. Statistical Learning Theory. John Wiley & Sons Inc. New York  1998.
[66] 
He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng  2009; 21(9): 1263-84.
[http://dx.doi.org/10.1109/TKDE.2008.239] 
[67] 
Hu J, Li Y, Zhang Y, Yu DJ. ATPbind: accurate protein-ATP binding site prediction by combining sequence-profiling and structure-based comparisons. J Chem Inf Model  2018; 58(2): 501-10.
[http://dx.doi.org/10.1021/acs.jcim.7b00397] 
[68] 
Wei ZS, Han K, Yang JY, Shen HB, Yu DJ. Protein-protein interaction sites prediction by ensembling SVM and sample-weighted random forests. Neurocomputing  2016; 193: 201-12.
[http://dx.doi.org/10.1016/j.neucom.2016.02.022] 
[69] 
Hu J, Li Y, Yan WX, Yang JY, Shen HB, Yu DJ. KNN-based dynamic query-driven sample rescaling strategy for class imbalance learning. Neurocomputing  2016; 191: 363-73.
[http://dx.doi.org/10.1016/j.neucom.2016.01.043] 
[70] 
Feng P, Yang H, Ding H, Lin H, Chen W, Chou KC. iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics  2019; 111(1): 96-102.
[http://dx.doi.org/10.1016/j.ygeno.2018.01.005] 
[71] 
Tatjewski M, Kierczak M, Plewczynski D. Predicting post-translational modifications from local sequence fragments using machine learning algorithms: Overview and best practices. Methods Mol Biol  2017; 1484: 275-300.
[72] 
Chen Z, Liu X, Li F, et al. Large-scale comparative assessment of computational predictors for lysine post-translational modification sites. Brief Bioinform  2019; 20(6): 2267-90.
[73] 
Qiu WR, Jiang SY, Xu ZC, Xiao X, Chou KC. iRNAm5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition. Oncotarget  2017; 8(25): 41178-88.
[http://dx.doi.org/10.18632/oncotarget.17104] 
[74] 
Jia J, Zhang L, Liu Z, Xiao X, Chou KC. pSumo-CD: predicting sumoylation sites in proteins with covariance discriminant algorithm by incorporating sequence-coupled effects into general PseAAC. Bioinformatics  2016; 32(20): 3133-41.
[http://dx.doi.org/10.1093/bioinformatics/btw387] 
[75] 
Jia J, Liu Z, Xiao X, Liu B, Chou KC. Identification of protein-protein binding sites by incorporating the physicochemical properties and stationary wavelet transforms into pseudo amino acid composition. J Biomol Struct Dyn  2016; 34(9): 1946-61.
[http://dx.doi.org/10.1080/07391102.2015.1095116] 
[76] 
Jiao YS, Du PF. Predicting protein submitochondrial locations by incorporating the positional-specific physicochemical properties into Chou’s general pseudo-amino acid compositions. J Theor Biol  2017; 416: 81-7.
[http://dx.doi.org/10.1016/j.jtbi.2016.12.026] 

Rights & Permissions Print Cite

Article Metrics

10

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893614666191202152328	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

Citrullination Site Prediction by Incorporating Sequence Coupled Effects into PseAAC and Resolving Data Imbalance Issue

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract