BDselect: A Package for k-mer Selection Based on the Binomial Distribution

Fu-Ying      Dao; Hao      Lv; Zhao-Yue      Zhang; Hao      Lin

doi:10.2174/1574893616666211007102747

Abstract

Background: Dimension disaster is often associated with feature extraction. The extracted features may contain more redundant feature information, which leads to the limitation of computing ability and overfitting problems.

Objective: Feature selection is an important strategy to overcome the problems from dimension disaster. In most machine learning tasks, features determine the upper limit of the model performance. Therefore, more and more feature selection methods should be developed to optimize redundant features.

Methods: In this paper, we introduce a new technique to optimize sequence features based on the Binomial Distribution (BD). Firstly, the principle of the binomial distribution algorithm is introduced in detail. Then, the proposed algorithm is compared with other commonly used feature selection methods on three different types of datasets by using a Random Forest classifier with the same parameters.

Results: The results confirm that BD has a promising improvement in feature selection and classification accuracy.

Conclusion: Finally, we provide the source code and executable program package (http: //lingroup. cn/server/BDselect/), by which users can easily perform our algorithm in their researches.

Keywords: Dimension disasters, feature selection, binomial distribution, machine learning, random forest classifier, datasets.

« Previous Next »

Graphical Abstract

[1] 
Margolis R, Derr L, Dunn M, et al. The National Institutes of Health’s Big Data to Knowledge (BD2K) initiative: Capitaliz-ing on biomedical big data. J Am Med Inform Assoc  2014; 21(6): 957-8.
[http://dx.doi.org/10.1136/amiajnl-2014-002974] [PMID:  25008006] 
[2] 
Zou Q, Lin G, Jiang X, Liu X. Zeng XJBib. Sequence cluster-ing in bioinformatics: an empirical study. Brief Bioinform  2020; 21(1): 1-10.
[http://dx.doi.org/10.1093/bib/bby090] [PMID:  30239587] 
[3] 
Cheng L, Qi C, Zhuang H, Fu T, Zhang X. gutMDisorder: A comprehensive database for dysbiosis of the gut microbiota in disorders and interventions. Nucleic Acids Res  2020; 48(D1): D554-60.
[http://dx.doi.org/10.1093/nar/gkz843] [PMID:  31584099] 
[4] 
Bishop CM. Pattern recognition and machine learning: Springer. springer 2006.
[5] 
Huang H, Gong X. A review of protein inter-residue distance prediction. Curr Bioinform  2020; 15(8): 821-30.
[http://dx.doi.org/10.2174/1574893615999200425230056] 
[6] 
Yu L, Wang M, Yang Y, et al. Predicting therapeutic drugs for hepatocellular carcinoma based on tissue-specific pathways. PLOS Comput Biol  2021; 17(2): e1008696.
[http://dx.doi.org/10.1371/journal.pcbi.1008696] [PMID:  33561121] 
[7] 
Zhao T, Hu Y, Peng J, Cheng L. DeepLGP: A novel deep learning method for prioritizing lncRNA target genes. Bioinformatics  2020; 36(16): 4466-72.
[http://dx.doi.org/10.1093/bioinformatics/btaa428] [PMID:  32467970] 
[8] 
Cheng L. Computational and biological methods for gene therapy. Curr Gene Ther  2019; 19(4): 210.
[http://dx.doi.org/10.2174/156652321904191022113307] [PMID:  31762421] 
[9] 
Liang P, Yang W, Chen X, et al. Machine learning of single-cell transcriptome highly identifies mRNA signature by com-paring F-score selection with DGE analysis. Mol Ther Nucleic Acids  2020; 20: 155-63.
[http://dx.doi.org/10.1016/j.omtn.2020.02.004] [PMID:  32169803] 
[10] 
Feng CQ, Zhang ZY, Zhu XJ, et al. iTerm-PseKNC: A se-quence-based tool for predicting bacterial transcriptional ter-minators. Bioinformatics  2019; 35(9): 1469-77.
[http://dx.doi.org/10.1093/bioinformatics/bty827] [PMID:  30247625] 
[11] 
He S, Guo F, Zou Q, Ding H. MRMD2.0: A python tool for machine learning with feature ranking and reduction. Curr Bioinform  2020; 15(10): 1213-21.
[http://dx.doi.org/10.2174/1574893615999200503030350] 
[12] 
Chen W, Feng P, Nie F. iATP: A Sequence based method for identifying anti-tubercular peptides. Med Chem  2020; 16(5): 620-5.
[http://dx.doi.org/10.2174/1573406415666191002152441] [PMID:  31339073] 
[13] 
Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res  2003; 3: 1157-82.
[14] 
Yu LS. Y, Zou Q, Wang S, Zheng L, Gao L. Exploring drug treatment patterns based on the action of drug and multi-layer network model. Int J Mol Sci  2020; 21(14): 5014.
[http://dx.doi.org/10.3390/ijms21145014] [PMID:  32708644] 
[15] 
Cheng L, Zhao H, Wang P, et al. Computational methods for identifying similar diseases. Mol Ther Nucleic Acids  2019; 18: 590-604.
[http://dx.doi.org/10.1016/j.omtn.2019.09.019] [PMID:  31678735] 
[16] 
Zhu S, Wang D, Yu K, Li T, Gong Y. Feature selection for gene expression using model-based entropy. IEEE/ACM Trans Comput Biol Bioinformatics  2010; 7(1): 25-36.
[http://dx.doi.org/10.1109/TCBB.2008.35] [PMID:  20150666] 
[17] 
Radovic M, Ghalwash M, Filipovic N, Obradovic Z. Mini-mum redundancy maximum relevance feature selection ap-proach for temporal gene expression data. BMC Bioinformatics  2017; 18(1): 9.
[http://dx.doi.org/10.1186/s12859-016-1423-9] [PMID:  28049413] 
[18] 
Zhang G, Yu P, Wang J, Yan C. Feature selection algorithm for high-dimensional biomedical data using information gain and improved chemical reaction optimization. Curr Bioinform  2020; 15(8): 912-26.
[http://dx.doi.org/10.2174/1574893615666200204154358] 
[19] 
Yu L, Zhou D, Gao L, Zha Y. Prediction of drug response in multilayer networks based on fusion of multiomics data. Methods (San Diego, Calif)  2021; 192: 85-92.
[http://dx.doi.org/10.1016/j.ymeth.2020.08.006] [PMID: 32798653] 
[20] 
Maldonado S, Weber R. A wrapper method for feature selec-tion using support vector machines. Inf. Sci. 179(13), 2208-2217. Inf Sci  2009; 179: 2208-17.
[http://dx.doi.org/10.1016/j.ins.2009.02.014] 
[21] 
Wong KKL. Optimization in the design of natural structures, biomaterials, bioinformatics and biometric techniques for solving physiological needs and ultimate performance of bio-devices. Curr Bioinform  2019; 14(5): 374-5.
[http://dx.doi.org/10.2174/157489361405190628122355] 
[22] 
Karamizadeh S, Abdullah SM, Manaf AA, Zamani M, Hoo-man A. An overview of principal component analysis. J Sig-nal Information Process  2013; 4(3B): 173.
[http://dx.doi.org/10.4236/jsip.2013.43B031] 
[23] 
Ding H, Feng P-M, Chen W, Lin H. Identification of bacterio-phage virion proteins by the ANOVA feature selection and analysis. Mol Biosyst  2014; 10(8): 2229-35.
[http://dx.doi.org/10.1039/C4MB00316K] [PMID:  24931825] 
[24] 
Li H, Long C, Xiang J, Liang P, Li X, Zuo Y. Dppa2/4 as a trigger of signaling pathways to promote zygote genome activation by binding to CG-rich region. Briefings Bioinform  2021; 22(4): bbaa342.
[http://dx.doi.org/10.1093/bib/bbaa342] [PMID:  33316032] 
[25] 
Yan K, Zhang D. Feature selection and analysis on correlated gas sensor data with recursive feature elimination. Sens Actuators B Chem  2015; 212: 353-63.
[http://dx.doi.org/10.1016/j.snb.2015.02.025] 
[26] 
Zhang T, Li X, Tao D, Yang J. Multimodal biometrics using geometry preserving projections. Pattern Recognit  2008; 41(3): 805-13.
[http://dx.doi.org/10.1016/j.patcog.2007.06.035] 
[27] 
Cheng L, Zhuang H, Ju H, et al. Exposing the causal effect of body mass index on the risk of type 2 diabetes mellitus: A mendelian randomization study. Front Genet  2019; 10: 94.
[http://dx.doi.org/10.3389/fgene.2019.00094] [PMID:  30891058] 
[28] 
Yang H, Yang W, Dao FY, et al. A comparison and assess-ment of computational method for identifying recombination hotspots in Saccharomyces cerevisiae. Brief Bioinform  2020; 21(5): 1568-80.
[http://dx.doi.org/10.1093/bib/bbz123] [PMID:  31633777] 
[29] 
Ao C, Zhou W, Gao L, Dong B, Yu L. Prediction of antioxi-dant proteins using hybrid feature representation method and random forest. Genomics  2020; 112(6): 4666-74.
[http://dx.doi.org/10.1016/j.ygeno.2020.08.016] [PMID:  32818637] 
[30] 
Zuo Y, Li Y, Chen Y, Li G, Yan Z, Yang L. PseKRAAC: A flexible web server for generating pseudo K-tuple reduced amino acids composition. Bioinformatics  2017; 33(1): 122-4.
[http://dx.doi.org/10.1093/bioinformatics/btw564] [PMID:  27565583] 
[31] 
Wang J, Chen S, Dong L, Wang G. CHTKC: A robust and efficient k-mer counting algorithm based on a lock-free chain-ing hash table. Brief Bioinform  2020; 22(3): bbaa063.
[http://dx.doi.org/10.1093/bib/bbaa063] [PMID:  32438416] 
[32] 
Zheng L, Huang S, Mu N, Zhang H, Zhang J, Chang Y. RAACBook: A web server of reduced amino acid alphabet for sequence-dependent inference by using Chou's five-step rule. Database (Oxford)  2019; 2019(2019): baz131.
[http://dx.doi.org/10.1093/database/baz131] [PMID:  31802128] 
[33] 
Lv H, Dao FY, Guan ZX, Yang H, Li YW, Lin H. Deep-Kcr: accurate detection of lysine crotonylation sites using deep learning method. Briefings Bioinform  2020; 22(4): bbaa255.
[http://dx.doi.org/10.1093/bib/bbaa255] [PMID:  33099604] 
[34] 
Zhang J, Liu B. A review on the recent developments of se-quence-based protein feature extraction methods. Curr Bioinform  2019; 14(3): 190-9.
[http://dx.doi.org/10.2174/1574893614666181212102749] 
[35] 
Li WC, Deng EZ, Ding H, Chen W, Lin H. iORI-PseKNC: a predictor for identifying origin of replication with pseudo k-tuple nucleotide composition. Chemom Intell Lab Syst  2015; 141: 100-6.
[http://dx.doi.org/10.1016/j.chemolab.2014.12.011] 
[36] 
Dao FY, Lv H, Wang F, et al. Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique. Bioinformatics  2019; 35(12): 2075-83.
[http://dx.doi.org/10.1093/bioinformatics/bty943] [PMID:  30428009] 
[37] 
Xiao X, Ye HX, Liu Z, Jia JH, Chou KC. iROS-gPseKNC: Predicting replication origin sites in DNA by incorporating di-nucleotide position-specific propensity into general pseudo nucleotide composition. Oncotarget  2016; 7(23): 34180-9.
[http://dx.doi.org/10.18632/oncotarget.9057] [PMID:  27147572] 
[38] 
Dao FY, Lv H, Zulfiqar H, et al. A computational platform to identify origins of replication sites in eukaryotes. Brief Bioinform  2021; 22(2): 1940-50.
[http://dx.doi.org/10.1093/bib/bbaa017] [PMID:  32065211] 
[39] 
Dao FY, Lv H, Yang YH, Zulfiqar H, Gao H, Lin H. Compu-tational identification of N6-methyladenosine sites in multiple tissues of mammals. Comput Struct Biotechnol J  2020; 18: 1084-91.
[http://dx.doi.org/10.1016/j.csbj.2020.04.015] [PMID:  32435427] 
[40] 
Nilsen TW. Molecular biology. Internal mRNA methylation finally finds functions. Science  2014; 343(6176): 1207-8.
[http://dx.doi.org/10.1126/science.1249340] [PMID:  24626918] 
[41] 
Liu ML, Su W, Wang JS, Yang YH, Yang H, Lin H. Predicting preference of transcription factors for methylated DNA using sequence information. Mol Ther Nucleic Acids  2020; 22: 1043-50.
[http://dx.doi.org/10.1016/j.omtn.2020.07.035] [PMID:  33294291] 
[42] 
Stadhouders R, Filion GJ, Graf T. Transcription factors and 3D genome conformation in cell-fate decisions. Nature  2019; 569(7756): 345-54.
[http://dx.doi.org/10.1038/s41586-019-1182-7] [PMID:  31092938] 
[43] 
Cheng L, Han X, Zhu Z, Qi C, Wang P, Zhang X. Functional alterations caused by mutations reflect evolutionary trends of SARS-CoV-2. Brief Bioinform  2021; 22(2): 1442-50.
[http://dx.doi.org/10.1093/bib/bbab042] [PMID:  33580783] 
[44] 
Peng H, Long F, Ding C. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell  2005; 27(8): 1226-38.
[http://dx.doi.org/10.1109/TPAMI.2005.159] [PMID:  16119262] 
[45] 
Zou Q, Zeng J, Cao L, Ji R. A novel features ranking metric with application to scalable visual and bioinformatics data classification. Neurocomputing  2016; 173: 346-54.
[http://dx.doi.org/10.1016/j.neucom.2014.12.123] 
[46] 
Tao Z, Li Y, Teng Z, Zhao Y. A method for identifying vesi-cle transport proteins based on LibSVM and MRMD. Comput Math Methods Med  2020; 2020: 8926750.
[http://dx.doi.org/10.1155/2020/8926750] [PMID:  33133228] 
[47] 
Yang L, Gao H, Wu K, Zhang H, Li C, Tang L. Identification of cancerlectins by using cascade linear discriminant analysis and optimal g-gap tripeptide composition. Curr Bioinform  2020; 15(6): 528-37.
[http://dx.doi.org/10.2174/1574893614666190730103156] 
[48] 
Zou Q, Wan S, Ju Y, Tang J, Zeng X. Pretata: Predicting TA-TA binding proteins with novel features and dimensionality reduction strategy. BMC Syst Biol  2016; 10(Suppl. 4): 114.
[http://dx.doi.org/10.1186/s12918-016-0353-5] [PMID:  28155714] 
[49] 
Zhang ZY, Yang YH, Ding H, Wang D, Chen W, Lin H. De-sign powerful predictor for mRNA subcellular location pre-diction in Homo sapiens. Brief Bioinform  2021; 22(1): 526-35.
[http://dx.doi.org/10.1093/bib/bbz177] [PMID:  31994694] 
[50] 
Liu H, Setiono R. Incremental feature selection. Appl Intell  1998; 9(3): 217-30.
[http://dx.doi.org/10.1023/A:1008363719778] 
[51] 
Breiman L. Random Forests. Mach Learn  2001; 45(1): 5-32.
[http://dx.doi.org/10.1023/A:1010933404324] 
[52] 
Schaduangrat N, Nantasenamat C, Prachayasittikul V, Shoombuatong W. ACPred: A computational tool for the pre-diction and analysis of anticancer peptides. Molecules  2019; 24(10): 1973.
[http://dx.doi.org/10.3390/molecules24101973] [PMID:  31121946] 
[53] 
Win TS, Malik AA, Prachayasittikul V. S Wikberg JE, Nantasenamat C, Shoombuatong W. HemoPred: A web server for predicting the hemolytic activity of peptides. Future Med Chem  2017; 9(3): 275-91.
[http://dx.doi.org/10.4155/fmc-2016-0188] [PMID:  28211294] 
[54] 
Win TS, Schaduangrat N, Prachayasittikul V, Nantasenamat C, Shoombuatong W. PAAP: A web server for predicting antihy-pertensive activity of peptides. Future Med Chem  2018; 10(15): 1749-67.
[http://dx.doi.org/10.4155/fmc-2017-0300] [PMID:  30039980] 
[55] 
Shoombuatong W, Schaduangrat N, Nantasenamat C. Unravel-ing the bioactivity of anticancer peptides as deduced from machine learning. EXCLI J  2018; 17: 734-52.
[PMID:  30190664] 
[56] 
Charoenkwan P, Kanthawong S, Nantasenamat C, Hasan MM, Shoombuatong W. iDPPIV-SCM: A sequence-based predictor for identifying and analyzing dipeptidyl peptidase IV (DPP-IV) inhibitory peptides using a scoring card method. J Proteome Res  2020; 19(10): 4125-36.
[http://dx.doi.org/10.1021/acs.jproteome.0c00590] [PMID:  32897718] 
[57] 
Charoenkwan P, Yana J, Nantasenamat C, Hasan MM, Shoombuatong W. iUmami-SCM: A novel sequence-based predictor for prediction and analysis of umami peptides using a scoring card method with propensity scores of dipeptides. J Chem Inf Model  2020; 60(12): 6666-78.
[http://dx.doi.org/10.1021/acs.jcim.0c00707] [PMID:  33094610] 
[58] 
Pal M. Random forest classifier for remote sensing classifica-tion. Int J Remote Sens  2005; 26(1): 217-22.
[http://dx.doi.org/10.1080/01431160412331269698] 
[59] 
Ahmad F, Farooq A, Khan MUG, Shabbir MZ, Rabbani M, Hussain I. Identification of most relevant features for classifi-cation of francisella tularensis using machine learning. Curr Bioinform  2020; 15(10): 1197-212.
[http://dx.doi.org/10.2174/1574893615666200219113900] 
[60] 
Shang Y, Gao L, Zou Q, Yu L. Prediction of drug-target inter-actions based on multi-layer network representation learning. Neurocomputing  2021; 434: 80-9.
[http://dx.doi.org/10.1016/j.neucom.2020.12.068] 
[61] 
Fu X, Cai L, Zeng X, Zou Q. StackCPPred: a stacking and pairwise energy content-based prediction of cell-penetrating peptides and their uptake efficiency. Bioinformatics  2020; 36(10): 3028-34.
[http://dx.doi.org/10.1093/bioinformatics/btaa131] [PMID:  32105326] 
[62] 
Cheng L. Omics data and artificial intelligence: New challeng-es for gene therapy. Curr Gene Ther  2020; 20(1): 1.
[http://dx.doi.org/10.2174/156652322001200604150041] [PMID:  32603274] 
[63] 
Zhao X, Wang H, Li H, Wu Y, Wang G. Identifying plant pentatricopeptide repeat proteins using a variable selection method. Front Plant Sci  2021; 12: 506681.
[http://dx.doi.org/10.3389/fpls.2021.506681] [PMID:  33732270] 
[64] 
Lv H, Zhang ZM, Li SH, Tan JX, Chen W, Lin H. Evaluation of different computational methods on 5-methylcytosine sites identification. Brief Bioinform  2020; 21(3): 982-95.
[http://dx.doi.org/10.1093/bib/bbz048] [PMID:  31157855] 
[65] 
Lv H, Dao FY, Zhang D, et al. iDNA-MS: An integrated com-putational tool for detecting DNA modification sites in multi-ple genomes. iScience  2020; 23(4): 100991.
[http://dx.doi.org/10.1016/j.isci.2020.100991] [PMID:  32240948] 
[66] 
Wang J, Shi Y, Wang X, Chang H. A drug target interaction prediction based on LINE-RF learning. Curr Bioinform  2020; 15(7): 750-7.
[http://dx.doi.org/10.2174/1574893615666191227092453] 
[67] 
Cai L, Ren X, Fu X, Peng L, Gao M, Zeng X. iEnhancer-XG: Interpretable sequence-based enhancers and their strength predictor. Bioinformatics  2021; 37(8): 1060-7.
[http://dx.doi.org/10.1093/bioinformatics/btaa914] [PMID:  33119044] 
[68] 
Zhao X, Jiao Q, Li H, et al. ECFS-DEA: An ensemble classifi-er-based feature selection for differential expression analysis on expression profiles. BMC Bioinform  2020; 21(1): 43.
[http://dx.doi.org/10.1186/s12859-020-3388-y] [PMID:  32024464] 
[69] 
Xu H, Zeng W, Zeng X, Yen GG. A polar-metric-based evolu-tionary algorithm. IEEE Trans Cybern  2021; 51(7): 3429-40.
[http://dx.doi.org/10.1109/TCYB.2020.2965230] [PMID:  32031958] 
[70] 
Jin S, Zeng X, Xia F, Huang W. Liu XJBiB. Application of deep learning methods in biological networks. Brief Bioinform  2021; 22(5): 1902-17.
[http://dx.doi.org/10.1093/bib/bbaa043] [PMID:  32363401] 
[71] 
Wang X, Yang Y, Liu J, Wang G. The stacking strategy-based hybrid framework for identifying non-coding RNAs. Brief Bioinform  2021; 22(5): bbab023.
[http://dx.doi.org/10.1093/bib/bbab023] [PMID:  33693454] 
[72] 
Khan YD, Alzahrani E, Alghamdi W, Ullah MZ. Sequence-based identification of allergen proteins developed by integra-tion of PseAAC and statistical moments via 5-step rule. Curr Bioinform  2020; 15(9): 1046-55.
[http://dx.doi.org/10.2174/1574893615999200424085947] 
[73] 
Wang H, Liang P, Zheng L, Long C, Li H, Zuo Y. eHSCPr discriminating the cell identity involved in endothelial to hematopoietic transition. Bioinformatics   2021; 37(15): 2157-64.
[http://dx.doi.org/10.1093/bioinformatics/btab071] [PMID: 33532815] 

Rights & Permissions Print Cite

Article Metrics

25

1

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893616666211007102747	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

BDselect: A Package for k-mer Selection Based on the Binomial Distribution

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract