Identification of Plasmodium Secreted Proteins Based on MonoDiKGap
and Distance-Based Top-n-Gram Methods

Xinyi      Liao; Xiaomei      Gu; Dejun      Peng

doi:10.2174/1574893617666220106112044

Abstract

Background: Many malarial infections are caused by Plasmodium falciparum. Accurate classification of the proteins secreted by the malaria parasite, which are essential for the development of anti-malarial drugs, is necessary.

Objective: This study aimed at accurately classifying the proteins secreted by the malaria parasite.

Methods: Therefore, in order to improve the accuracy of the prediction of Plasmodium secreted proteins, we established a classification model MGAP-SGD. MonodikGap features (k=7) of the secreted proteins were extracted, and then the optimal features were selected by the AdaBoost method. Finally, based on the optimal set of secreted proteins, the model was used to predict the secreted proteins using the Stochastic Gradient Descent (SGD) algorithm.

Results: We used a 10-fold cross-validation set and independent test set in the stochastic gradient descent (SGD) classifier to validate the model, and the accuracy rates were found to be 98.5859% and 97.973%, respectively.

Conclusion: This study confirms the effectiveness and robustness of the prediction results of the MGAP-SGD model that can meet the prediction requirements of the secreted proteins of Plasmodium.

Keywords: Plasmodium, Top-n-gram, MonoDiKGap, dimensionality reduction, cross-validation, features.

« Previous Next »

[1]
Pandey RK, Ali M, Ojha R, Bhatt TK, Prajapati VK. Development of multi-epitope driven subunit vaccine in secretory and membrane protein of Plasmodium falciparum to convey protection against malaria infection. Vaccine  2018; 36(30): 4555-65.
 [http://dx.doi.org/10.1016/j.vaccine.2018.05.082] [PMID: 29921492]

[2]
Michael Beman J, Arrigo KR, Matson PA. Agricultural runoff fuels large phytoplankton blooms in vulnerable areas of the ocean. Nature  2005; 434(7030): 211-4.
 [http://dx.doi.org/10.1038/nature03370] [PMID: 15758999]

[3]
Bhattacharjee S, van Ooij C, Balu B, Adams JH, Haldar K. Maurer’s clefts of Plasmodium falciparum are secretory organelles that concentrate virulence protein reporters for delivery to the host erythrocyte. Blood  2008; 111(4): 2418-26.
 [http://dx.doi.org/10.1182/blood-2007-09-115279] [PMID: 18057226]

[4]
Singh M, Mukherjee P, Narayanasamy K, et al. Proteome analysis of Plasmodium falciparum extracellular secretory antigens at asexual blood stages reveals a cohort of proteins with possible roles in immune modulation and signaling. Mol Cell Proteomics  2009; 8(9): 2102-18.
 [http://dx.doi.org/10.1074/mcp.M900029-MCP200] [PMID: 19494339]

[5]
Fan GL, Zhang XY, Liu YL, Nang Y, Wang H. DSPMP: Discriminating secretory proteins of malaria parasite by hybridizing different descriptors of Chou’s pseudo amino acid patterns. J Comput Chem  2015; 36(31): 2317-27.
 [http://dx.doi.org/10.1002/jcc.24210] [PMID: 26484844]

[6]
Fu X, Cai L, Zeng X, Zou Q. StackCPPred: a stacking and pairwise energy content-based prediction of cell-penetrating peptides and their uptake efficiency. Bioinformatics  2020; 36(10): 3028-34.
 [http://dx.doi.org/10.1093/bioinformatics/btaa131] [PMID: 32105326]

[7]
Cai L, Wang L, Fu X, Xia C, Zeng X, Zou Q. ITP-Pred: An interpretable method for predicting, therapeutic peptides with fused features low-dimension representation. Brief Bioinform  2021; 22(4): bbaa367.
 [http://dx.doi.org/10.1093/bib/bbaa367] [PMID: 33313672]

[8]
Jin S, Zeng X, Xia F, Huang W, Liu X. Application of deep learning methods in biological networks. Brief Bioinform  2021; 22(2): 1902-17.
 [http://dx.doi.org/10.1093/bib/bbaa043] [PMID: 32363401]

[9]
Zuo YC, Li QZ. Using K-minimum increment of diversity to predict secretory proteins of malaria parasite based on groupings of amino acids. Amino Acids  2010; 38(3): 859-67.
 [http://dx.doi.org/10.1007/s00726-009-0292-1] [PMID: 19387791]

[10]
Lin WZ, Fang JA, Xiao X, Chou KC. Predicting secretory proteins of malaria parasite by incorporating sequence evolution information into pseudo amino acid composition via grey system model. PLoS One  2012; 7(11): e49040.
 [http://dx.doi.org/10.1371/journal.pone.0049040] [PMID: 23189138]

[11]
Zhang CT, Chou KC. An analysis of protein folding type prediction by seed-propagated sampling and jackknife test. J Protein Chem  1995; 14(7): 583-93.
 [http://dx.doi.org/10.1007/BF01886884] [PMID: 8561854]

[12]
Feng YE. Identify secretory protein of malaria parasite with modified quadratic discriminant algorithm and amino acid composition. Interdiscip Sci  2016; 8(2): 156-61.
 [http://dx.doi.org/10.1007/s12539-015-0112-0] [PMID: 26286010]

[13]
Hua T, Zhang C, Rong C, Huang P, Ping Z. Identification of secretory proteins of malaria parasite by feature selection technique. Lett Org Chem  2017; 14(999): 1-1.

[14]
Lakens D. Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Front Psychol  2013; 4: 863.
 [http://dx.doi.org/10.3389/fpsyg.2013.00863] [PMID: 24324449]

[15]
Warmuth MK, Liao J, Rätsch G, Mathieson M, Putta S, Lemmen C. Active learning with support vector machines in the drug discovery process. J Chem Inf Comput Sci  2003; 43(2): 667-73.
 [http://dx.doi.org/10.1021/ci025620t] [PMID: 12653536]

[16]
Muhammod R, Ahmed S, Md Farid D, Shatabda S, Sharma A, Dehzangi A. PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences. Bioinformatics  2019; 35(19): 3831-3.
 [http://dx.doi.org/10.1093/bioinformatics/btz165] [PMID: 30850831]

[17]
Cheong JH, Xie T, Byrne S, Chang LJ. Py-Feat: Python facial expression analysis toolbox. arXiv  2021; 2021: 2104.03509.

[18]
Liu ML, Su W, Wang JS, Yang YH, Yang H, Lin H. Predicting preference of transcription factors for methylated DNA using sequence information. Mol Ther Nucleic Acids  2020; 22: 1043-50.
 [http://dx.doi.org/10.1016/j.omtn.2020.07.035] [PMID: 33294291]

[19]
Verma R, Tiwari A, Kaur S, Varshney GC, Raghava GP, Raghava GP. Identification of proteins secreted by malaria parasite into erythrocyte using SVM and PSSM profiles. BMC Bioinformatics  2008; 9: 201.
 [http://dx.doi.org/10.1186/1471-2105-9-201] [PMID: 18416838]

[20]
Hao N. Curse of Dimensionality. Statistics Reference Online:  Wiley Stats Ref 2020.
 [http://dx.doi.org/10.1002/9781118445112.stat00408.pub2]

[21]
Zhu T, Wang L, Fu Y, Ren Y. JPEG steganalysis based on locality preserving projection dimen- sionality reduction method. Appl Mech Mater  2013; 411-414: 1185-8.
 [http://dx.doi.org/10.4028/www.scientific.net/AMM.411-414.1185]

[22]
Ji Z, Hui Z, Rosset S, Hastie T. Statistics & its interface. Multi-class AdaBoost  2009; 2(3): 349-60.

[23]
Liu B, Wang X, Lin L, Dong Q, Wang X. A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis. BMC Bioinformatics  2008; 9: 510.
 [http://dx.doi.org/10.1186/1471-2105-9-510] [PMID: 19046430]

[24]
Saigo H, Vert JP, Ueda N, Akutsu T. Protein homology detection using string alignment kernels. Bioinformatics  2004; 20(11): 1682-9.
 [http://dx.doi.org/10.1093/bioinformatics/bth141] [PMID: 14988126]

[25]
Dong QW, Wang XL, Lin L. Application of latent semantic analysis to protein remote homology detection. Bioinformatics  2006; 22(3): 285-90.
 [http://dx.doi.org/10.1093/bioinformatics/bti801] [PMID: 16317074]

[26]
He S, Guo F, Zou Q, Ding H. MRMD2.0: A Python tool for machine learning with feature ranking and reduction. Curr Bioinform  2020; 15(10): 1213-21.
 [http://dx.doi.org/10.2174/1574893615999200503030350]

[27]
Towell GG, Shavlik JDW, Noordewier MO. Refinement of approximate domain theories by knowledge-based neural networks.  In: Proceedings of the 8th National Conference on Artificial Intelligence; July 29 1990; 

[28]
Asuncion A. UCI machine learning repository, university of california, irvine, school of information and computer sciences. 2007. Available from: http://www.ics.uci.edu/

[29]
Zhu XJ, Feng CQ, Lai HY, Chen W, Lin H. Predicting protein structural classes for low-similarity sequences by evaluating different features. Knowl Base Syst  2019; 163: 787-93.
 [http://dx.doi.org/10.1016/j.knosys.2018.10.007]

[30]
Tang H, Zhao YW, Zou P, et al. HBPred: A tool to identify growth hormone-binding proteins. Int J Biol Sci  2018; 14(8): 957-64.
 [http://dx.doi.org/10.7150/ijbs.24174] [PMID: 29989085]

[31]
Yang H, Luo Y, Ren X, et al. Risk prediction of diabetes: Big data mining with fusion of multifarious physical examination indicators. Inf Fusion  2021; 75: 140-9.
 [http://dx.doi.org/10.1016/j.inffus.2021.02.015]

[32]
Quan Z, Zeng J, Cao L, Ji R. A novel features ranking metric with application to scalable visual and bioinformatics data classification. Neurocomputing  2016; 173: 346-54.
 [http://dx.doi.org/10.1016/j.neucom.2014.12.123]

[33]
Hutzler NR. Chi-squared test for binned, gaussian samples. Metrologia  2019; 56(5): 055007.
 [http://dx.doi.org/10.1088/1681-7575/ab2d53]

[34]
Liu B, Chen J, Wang X. Application of learning to rank to protein remote homology detection. Bioinformatics  2015; 31(21): 3492-8.
 [http://dx.doi.org/10.1093/bioinformatics/btv413] [PMID: 26163693]

[35]
Tibshirani RJ. Regression shrinkage and selection via the LASSO. J R Stat Soc B  1996; 73(1): 273-82.
 [http://dx.doi.org/10.1111/j.2517-6161.1996.tb02080.x]

[36]
Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol  2005; 3(2): 185-205.
 [http://dx.doi.org/10.1142/S0219720005001004] [PMID: 15852500]

[37]
Xue L, Tang B, Chen W, Luo J. DeepT3: Deep convolutional neural networks accurately identify Gram-negative bacterial type III secreted effectors using the N-terminal sequence. Bioinformatics  2019; 35(12): 2051-7.
 [http://dx.doi.org/10.1093/bioinformatics/bty931] [PMID: 30407530]

[38]
Harley CB, Reynolds RP. Analysis of E. coli promoter sequences. Nucleic Acids Res  1987; 15(5): 2343-61.
 [http://dx.doi.org/10.1093/nar/15.5.2343] [PMID: 3550697]

[39]
Sun C, Hu J, Lam KM. Feature subset selection for efficient AdaBoost training.  In: Proceedings of the IEEE International Conference on Multimedia & Expo; 11-15 July; JulyBarcelona, Spain. 2011.

[40]
Paras SGD. Stochastic Gradient Descent. In: Deep Learning with Python. Berkeley, CA: APress 2017.
 [http://dx.doi.org/10.1007/978-1-4842-2766-4_8]

[41]
Wang H, Ding Y, Tang J, Guo F. Identification of membrane protein types via multivariate information fusion with Hilbert-Schmidt Independence Criterion. Neurocomputing  2020; 383: 257-69.
 [http://dx.doi.org/10.1016/j.neucom.2019.11.103]

[42]
Ding YT, Jun J, Fei G. Identification of drug-target interactions via dual laplacian regularized least squares with multiple Kernel Fusion. Knowl Base Syst  2020; 2020: 204.
 [http://dx.doi.org/10.1016/j.knosys.2020.106254]

[43]
Ding Y, Tang J, Guo F. Identification of drug-target interactions via fuzzy bipartite local model. Neural Comput Appl  2020; 23: 10303-19.
 [http://dx.doi.org/10.1007/s00521-019-04569-z]

[44]
Ding Y, Tang J, Guo F. Identification of drug-side effect association via multiple information integration with centered kernel alignment. Neurocomputing  2019; 325: 211-24.
 [http://dx.doi.org/10.1016/j.neucom.2018.10.028]

[45]
Wei L, Xing P, Shi G, Ji Z, Zou Q. Fast prediction of protein methylation sites using a sequence-based feature selection technique. IEEE/ACM Trans Comput Biol Bioinformatics  2019; 16(4): 1264-73.
 [http://dx.doi.org/10.1109/TCBB.2017.2670558] [PMID: 28222000]

[46]
Wei L, Liao M, Gao Y, Ji R, He Z, Zou Q. Improved and promising identification of human microRNAs by incorporating a high-quality negative set. IEEE/ACM Trans Comput Biol Bioinformatics  2014; 11(1): 192-201.
 [http://dx.doi.org/10.1109/TCBB.2013.146] [PMID: 26355518]

[47]
Wei L, Xing P, Zeng J, Chen J, Su R, Guo F. Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier. Artif Intell Med  2017; 83: 67-74.
 [http://dx.doi.org/10.1016/j.artmed.2017.03.001] [PMID: 28320624]

[48]
Wei L, Wan S, Guo J, Wong KKL. A novel hierarchical selective ensemble classifier with bioinformatics application. Artif Intell Med  2017; 83: 82-90.
 [http://dx.doi.org/10.1016/j.artmed.2017.02.005] [PMID: 28245947]

[49]
Wei L, Ding Y, Su R, Tang J, Zou Q. Prediction of human protein subcellular localization using deep learning. J Parallel Distrib Comput  2018; 117: 212-7.
 [http://dx.doi.org/10.1016/j.jpdc.2017.08.009]

[50]
Zhang D, Chen H-D, Zulfiqar H, et al. iBLP: An XGBoost-based predictor for identifying bioluminescent proteins. Comput Math Methods Med  2021; 2021: 6664362.
 [http://dx.doi.org/10.1155/2021/6664362] [PMID: 33505515]

[51]
Zeng X, Zhu S, Liu X, Zhou Y, Nussinov R, Cheng F. deepDR: a network-based deep learning approach to in silico drug repositioning. Bioinformatics  2019; 35(24): 5191-8.
 [http://dx.doi.org/10.1093/bioinformatics/btz418] [PMID: 31116390]

[52]
Hong Z, Zeng X, Wei L, Liu X. Identifying enhancer-promoter interactions with neural network based on pre-trained DNA vectors and attention mechanism. Bioinformatics  2020; 36(4): 1037-43.
 [PMID: 31588505]

[53]
Zeng X, Lin W, Guo M, Zou Q, Gardner PP. A comprehensive overview and evaluation of circular RNA detection tools. PLOS Comput Biol  2017; 13(6): e1005420.
 [http://dx.doi.org/10.1371/journal.pcbi.1005420] [PMID: 28594838]

[54]
Chicco D, Warrens MJ, Jurman G. The Matthews Correlation Coefficient (MCC) is more informative than Cohen’s kappa and brier score in binary classification assessment. IEEE Access  2021; 9: 78368-81.
 [http://dx.doi.org/10.3389/fbioe.2020.584807] [PMID: 33195148]

[55]
Boughorbel S, Jarray F, El-Anbari M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS One  2017; 12(6): e0177678.
 [http://dx.doi.org/10.1371/journal.pone.0177678] [PMID: 28574989]

[56]
Wang H, Jijun T, Yijie Y, Guo F. Exploring associations of non-coding RNAs in human diseases via three-matrix factorization with hypergraph-regular terms on center kernel alignment. Brief Bioinform  2021; 22(5): bbaa409.
 [http://dx.doi.org/10.1093/bib/bbaa409] [PMID: 33443536]

[57]
Li J, Yuqian P, Tang J, Zou Q. DeepATT: A hybrid category attention neural network for identifying functional effects of DNA sequences. Brief Bioinform  2021; 22(3): bbaa159.
 [http://dx.doi.org/10.1093/bib/bbaa159] [PMID: 32778871]

[58]
Shen Y, Tang J, Guo F. Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou’s general PseAAC. J Theor Biol  2019; 462: 230-9.
 [http://dx.doi.org/10.1016/j.jtbi.2018.11.012] [PMID: 30452958]

[59]
Su R, Wu H, Xu B, Liu X, Wei L. Developing a multi-dose computational model for drug-induced hepatotoxicity prediction based on toxicogenomics data. IEEE/ACM Trans Comput Biol Bioinformatics  2019; 16(4): 1231-9.
 [http://dx.doi.org/10.1109/TCBB.2018.2858756] [PMID: 30040651]

[60]
Jin Q, Meng Z, Tuan DP, Chen Q, Wei L, Su R. DUNet: A deformable network for retinal vessel segmentation. Knowl Base Syst  2019; 178: 149-62.
 [http://dx.doi.org/10.1016/j.knosys.2019.04.025]

[61]
Su R, Hu J, Zou Q, Manavalan B, Wei L. Empirical comparison and analysis of web-based cell-penetrating peptide prediction tools. Brief Bioinform  2020; 21(2): 408-20.
 [http://dx.doi.org/10.1093/bib/bby124] [PMID: 30649170]

[62]
Zeng X, Zhu S, Lu W, et al. Target identification among known drugs by deep learning from heterogeneous networks. Chem Sci (Camb)  2020; 11(7): 1775-97.
 [http://dx.doi.org/10.1039/C9SC04336E] [PMID: 34123272]

[63]
Zeng X, Zhong Y, Lin W, Zou Q. Predicting disease-associated circular RNAs using deep forests combined with positive-unlabeled learning methods. Brief Bioinform  2020; 21(4): 1425-36.
 [http://dx.doi.org/10.1093/bib/bbz080] [PMID: 31612203]

[64]
Cheng L, Hu Y, Sun J, Zhou M, Jiang Q. DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function. Bioinformatics  2018; 34(11): 1953-6.
 [http://dx.doi.org/10.1093/bioinformatics/bty002] [PMID: 29365045]

[65]
Jiang Q, Wang G, Jin S, Li Y, Wang Y. Predicting human microRNA-disease associations based on support vector machine. Int J Data Min Bioinform  2013; 8(3): 282-93.
 [http://dx.doi.org/10.1504/IJDMB.2013.056078] [PMID: 24417022]

[66]
Jiang Q, Hao Y, Wang G, et al. Prioritization of disease microRNAs through a human phenome-microRNAome network. BMC Syst Biol  2010; 4 (Suppl. 1): S2.
 [http://dx.doi.org/10.1186/1752-0509-4-S1-S2] [PMID: 20522252]

[67]
Zhai Y, Chen Y, Teng Z, Zhao Y. Identifying antioxidant proteins by using amino acid composition and protein-protein interactions. Front Cell Dev Biol  2020; 8: 591487.
 [http://dx.doi.org/10.3389/fcell.2020.591487] [PMID: 33195258]

[68]
Nikam R, Gromiha MM. Seq2Feature: a comprehensive web-based feature extraction tool. Bioinformatics  2019; 35(22): 4797-9.
 [http://dx.doi.org/10.1093/bioinformatics/btz432] [PMID: 31135038]

[69]
Feng C, Zou Q, Wang D. Using a low correlation high orthogonality feature set and machine learning methods to identify plant pentatricopeptide repeat coding gene/protein. Chem Rev  2007; 107: 2411-502.

Rights & Permissions Print Cite

Article Metrics

1

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893617666220106112044	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

Identification of Plasmodium Secreted Proteins Based on MonoDiKGap and Distance-Based Top-n-Gram Methods

Abstract Play Pause

Related Journals

Related Books

Abstract