Generic placeholder image

Current Bioinformatics

Editor-in-Chief

ISSN (Print): 1574-8936
ISSN (Online): 2212-392X

Research Article

BDselect: A Package for k-mer Selection Based on the Binomial Distribution

Author(s): Fu-Ying Dao, Hao Lv, Zhao-Yue Zhang and Hao Lin*

Volume 17, Issue 3, 2022

Published on: 26 January, 2022

Page: [238 - 244] Pages: 7

DOI: 10.2174/1574893616666211007102747

Price: $65

Abstract

Background: Dimension disaster is often associated with feature extraction. The extracted features may contain more redundant feature information, which leads to the limitation of computing ability and overfitting problems.

Objective: Feature selection is an important strategy to overcome the problems from dimension disaster. In most machine learning tasks, features determine the upper limit of the model performance. Therefore, more and more feature selection methods should be developed to optimize redundant features.

Methods: In this paper, we introduce a new technique to optimize sequence features based on the Binomial Distribution (BD). Firstly, the principle of the binomial distribution algorithm is introduced in detail. Then, the proposed algorithm is compared with other commonly used feature selection methods on three different types of datasets by using a Random Forest classifier with the same parameters.

Results: The results confirm that BD has a promising improvement in feature selection and classification accuracy.

Conclusion: Finally, we provide the source code and executable program package (http: //lingroup. cn/server/BDselect/), by which users can easily perform our algorithm in their researches.

Keywords: Dimension disasters, feature selection, binomial distribution, machine learning, random forest classifier, datasets.

Graphical Abstract

[1]
Margolis R, Derr L, Dunn M, et al. The National Institutes of Health’s Big Data to Knowledge (BD2K) initiative: Capitaliz-ing on biomedical big data. J Am Med Inform Assoc 2014; 21(6): 957-8.
[http://dx.doi.org/10.1136/amiajnl-2014-002974] [PMID: 25008006]
[2]
Zou Q, Lin G, Jiang X, Liu X. Zeng XJBib. Sequence cluster-ing in bioinformatics: an empirical study. Brief Bioinform 2020; 21(1): 1-10.
[http://dx.doi.org/10.1093/bib/bby090] [PMID: 30239587]
[3]
Cheng L, Qi C, Zhuang H, Fu T, Zhang X. gutMDisorder: A comprehensive database for dysbiosis of the gut microbiota in disorders and interventions. Nucleic Acids Res 2020; 48(D1): D554-60.
[http://dx.doi.org/10.1093/nar/gkz843] [PMID: 31584099]
[4]
Bishop CM. Pattern recognition and machine learning: Springer. springer 2006.
[5]
Huang H, Gong X. A review of protein inter-residue distance prediction. Curr Bioinform 2020; 15(8): 821-30.
[http://dx.doi.org/10.2174/1574893615999200425230056]
[6]
Yu L, Wang M, Yang Y, et al. Predicting therapeutic drugs for hepatocellular carcinoma based on tissue-specific pathways. PLOS Comput Biol 2021; 17(2): e1008696.
[http://dx.doi.org/10.1371/journal.pcbi.1008696] [PMID: 33561121]
[7]
Zhao T, Hu Y, Peng J, Cheng L. DeepLGP: A novel deep learning method for prioritizing lncRNA target genes. Bioinformatics 2020; 36(16): 4466-72.
[http://dx.doi.org/10.1093/bioinformatics/btaa428] [PMID: 32467970]
[8]
Cheng L. Computational and biological methods for gene therapy. Curr Gene Ther 2019; 19(4): 210.
[http://dx.doi.org/10.2174/156652321904191022113307] [PMID: 31762421]
[9]
Liang P, Yang W, Chen X, et al. Machine learning of single-cell transcriptome highly identifies mRNA signature by com-paring F-score selection with DGE analysis. Mol Ther Nucleic Acids 2020; 20: 155-63.
[http://dx.doi.org/10.1016/j.omtn.2020.02.004] [PMID: 32169803]
[10]
Feng CQ, Zhang ZY, Zhu XJ, et al. iTerm-PseKNC: A se-quence-based tool for predicting bacterial transcriptional ter-minators. Bioinformatics 2019; 35(9): 1469-77.
[http://dx.doi.org/10.1093/bioinformatics/bty827] [PMID: 30247625]
[11]
He S, Guo F, Zou Q, Ding H. MRMD2.0: A python tool for machine learning with feature ranking and reduction. Curr Bioinform 2020; 15(10): 1213-21.
[http://dx.doi.org/10.2174/1574893615999200503030350]
[12]
Chen W, Feng P, Nie F. iATP: A Sequence based method for identifying anti-tubercular peptides. Med Chem 2020; 16(5): 620-5.
[http://dx.doi.org/10.2174/1573406415666191002152441] [PMID: 31339073]
[13]
Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res 2003; 3: 1157-82.
[14]
Yu LS. Y, Zou Q, Wang S, Zheng L, Gao L. Exploring drug treatment patterns based on the action of drug and multi-layer network model. Int J Mol Sci 2020; 21(14): 5014.
[http://dx.doi.org/10.3390/ijms21145014] [PMID: 32708644]
[15]
Cheng L, Zhao H, Wang P, et al. Computational methods for identifying similar diseases. Mol Ther Nucleic Acids 2019; 18: 590-604.
[http://dx.doi.org/10.1016/j.omtn.2019.09.019] [PMID: 31678735]
[16]
Zhu S, Wang D, Yu K, Li T, Gong Y. Feature selection for gene expression using model-based entropy. IEEE/ACM Trans Comput Biol Bioinformatics 2010; 7(1): 25-36.
[http://dx.doi.org/10.1109/TCBB.2008.35] [PMID: 20150666]
[17]
Radovic M, Ghalwash M, Filipovic N, Obradovic Z. Mini-mum redundancy maximum relevance feature selection ap-proach for temporal gene expression data. BMC Bioinformatics 2017; 18(1): 9.
[http://dx.doi.org/10.1186/s12859-016-1423-9] [PMID: 28049413]
[18]
Zhang G, Yu P, Wang J, Yan C. Feature selection algorithm for high-dimensional biomedical data using information gain and improved chemical reaction optimization. Curr Bioinform 2020; 15(8): 912-26.
[http://dx.doi.org/10.2174/1574893615666200204154358]
[19]
Yu L, Zhou D, Gao L, Zha Y. Prediction of drug response in multilayer networks based on fusion of multiomics data. Methods (San Diego, Calif) 2021; 192: 85-92.
[http://dx.doi.org/10.1016/j.ymeth.2020.08.006] [PMID: 32798653]
[20]
Maldonado S, Weber R. A wrapper method for feature selec-tion using support vector machines. Inf. Sci. 179(13), 2208-2217. Inf Sci 2009; 179: 2208-17.
[http://dx.doi.org/10.1016/j.ins.2009.02.014]
[21]
Wong KKL. Optimization in the design of natural structures, biomaterials, bioinformatics and biometric techniques for solving physiological needs and ultimate performance of bio-devices. Curr Bioinform 2019; 14(5): 374-5.
[http://dx.doi.org/10.2174/157489361405190628122355]
[22]
Karamizadeh S, Abdullah SM, Manaf AA, Zamani M, Hoo-man A. An overview of principal component analysis. J Sig-nal Information Process 2013; 4(3B): 173.
[http://dx.doi.org/10.4236/jsip.2013.43B031]
[23]
Ding H, Feng P-M, Chen W, Lin H. Identification of bacterio-phage virion proteins by the ANOVA feature selection and analysis. Mol Biosyst 2014; 10(8): 2229-35.
[http://dx.doi.org/10.1039/C4MB00316K] [PMID: 24931825]
[24]
Li H, Long C, Xiang J, Liang P, Li X, Zuo Y. Dppa2/4 as a trigger of signaling pathways to promote zygote genome activation by binding to CG-rich region. Briefings Bioinform 2021; 22(4): bbaa342.
[http://dx.doi.org/10.1093/bib/bbaa342] [PMID: 33316032]
[25]
Yan K, Zhang D. Feature selection and analysis on correlated gas sensor data with recursive feature elimination. Sens Actuators B Chem 2015; 212: 353-63.
[http://dx.doi.org/10.1016/j.snb.2015.02.025]
[26]
Zhang T, Li X, Tao D, Yang J. Multimodal biometrics using geometry preserving projections. Pattern Recognit 2008; 41(3): 805-13.
[http://dx.doi.org/10.1016/j.patcog.2007.06.035]
[27]
Cheng L, Zhuang H, Ju H, et al. Exposing the causal effect of body mass index on the risk of type 2 diabetes mellitus: A mendelian randomization study. Front Genet 2019; 10: 94.
[http://dx.doi.org/10.3389/fgene.2019.00094] [PMID: 30891058]
[28]
Yang H, Yang W, Dao FY, et al. A comparison and assess-ment of computational method for identifying recombination hotspots in Saccharomyces cerevisiae. Brief Bioinform 2020; 21(5): 1568-80.
[http://dx.doi.org/10.1093/bib/bbz123] [PMID: 31633777]
[29]
Ao C, Zhou W, Gao L, Dong B, Yu L. Prediction of antioxi-dant proteins using hybrid feature representation method and random forest. Genomics 2020; 112(6): 4666-74.
[http://dx.doi.org/10.1016/j.ygeno.2020.08.016] [PMID: 32818637]
[30]
Zuo Y, Li Y, Chen Y, Li G, Yan Z, Yang L. PseKRAAC: A flexible web server for generating pseudo K-tuple reduced amino acids composition. Bioinformatics 2017; 33(1): 122-4.
[http://dx.doi.org/10.1093/bioinformatics/btw564] [PMID: 27565583]
[31]
Wang J, Chen S, Dong L, Wang G. CHTKC: A robust and efficient k-mer counting algorithm based on a lock-free chain-ing hash table. Brief Bioinform 2020; 22(3): bbaa063.
[http://dx.doi.org/10.1093/bib/bbaa063] [PMID: 32438416]
[32]
Zheng L, Huang S, Mu N, Zhang H, Zhang J, Chang Y. RAACBook: A web server of reduced amino acid alphabet for sequence-dependent inference by using Chou's five-step rule. Database (Oxford) 2019; 2019(2019): baz131.
[http://dx.doi.org/10.1093/database/baz131] [PMID: 31802128]
[33]
Lv H, Dao FY, Guan ZX, Yang H, Li YW, Lin H. Deep-Kcr: accurate detection of lysine crotonylation sites using deep learning method. Briefings Bioinform 2020; 22(4): bbaa255.
[http://dx.doi.org/10.1093/bib/bbaa255] [PMID: 33099604]
[34]
Zhang J, Liu B. A review on the recent developments of se-quence-based protein feature extraction methods. Curr Bioinform 2019; 14(3): 190-9.
[http://dx.doi.org/10.2174/1574893614666181212102749]
[35]
Li WC, Deng EZ, Ding H, Chen W, Lin H. iORI-PseKNC: a predictor for identifying origin of replication with pseudo k-tuple nucleotide composition. Chemom Intell Lab Syst 2015; 141: 100-6.
[http://dx.doi.org/10.1016/j.chemolab.2014.12.011]
[36]
Dao FY, Lv H, Wang F, et al. Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique. Bioinformatics 2019; 35(12): 2075-83.
[http://dx.doi.org/10.1093/bioinformatics/bty943] [PMID: 30428009]
[37]
Xiao X, Ye HX, Liu Z, Jia JH, Chou KC. iROS-gPseKNC: Predicting replication origin sites in DNA by incorporating di-nucleotide position-specific propensity into general pseudo nucleotide composition. Oncotarget 2016; 7(23): 34180-9.
[http://dx.doi.org/10.18632/oncotarget.9057] [PMID: 27147572]
[38]
Dao FY, Lv H, Zulfiqar H, et al. A computational platform to identify origins of replication sites in eukaryotes. Brief Bioinform 2021; 22(2): 1940-50.
[http://dx.doi.org/10.1093/bib/bbaa017] [PMID: 32065211]
[39]
Dao FY, Lv H, Yang YH, Zulfiqar H, Gao H, Lin H. Compu-tational identification of N6-methyladenosine sites in multiple tissues of mammals. Comput Struct Biotechnol J 2020; 18: 1084-91.
[http://dx.doi.org/10.1016/j.csbj.2020.04.015] [PMID: 32435427]
[40]
Nilsen TW. Molecular biology. Internal mRNA methylation finally finds functions. Science 2014; 343(6176): 1207-8.
[http://dx.doi.org/10.1126/science.1249340] [PMID: 24626918]
[41]
Liu ML, Su W, Wang JS, Yang YH, Yang H, Lin H. Predicting preference of transcription factors for methylated DNA using sequence information. Mol Ther Nucleic Acids 2020; 22: 1043-50.
[http://dx.doi.org/10.1016/j.omtn.2020.07.035] [PMID: 33294291]
[42]
Stadhouders R, Filion GJ, Graf T. Transcription factors and 3D genome conformation in cell-fate decisions. Nature 2019; 569(7756): 345-54.
[http://dx.doi.org/10.1038/s41586-019-1182-7] [PMID: 31092938]
[43]
Cheng L, Han X, Zhu Z, Qi C, Wang P, Zhang X. Functional alterations caused by mutations reflect evolutionary trends of SARS-CoV-2. Brief Bioinform 2021; 22(2): 1442-50.
[http://dx.doi.org/10.1093/bib/bbab042] [PMID: 33580783]
[44]
Peng H, Long F, Ding C. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 2005; 27(8): 1226-38.
[http://dx.doi.org/10.1109/TPAMI.2005.159] [PMID: 16119262]
[45]
Zou Q, Zeng J, Cao L, Ji R. A novel features ranking metric with application to scalable visual and bioinformatics data classification. Neurocomputing 2016; 173: 346-54.
[http://dx.doi.org/10.1016/j.neucom.2014.12.123]
[46]
Tao Z, Li Y, Teng Z, Zhao Y. A method for identifying vesi-cle transport proteins based on LibSVM and MRMD. Comput Math Methods Med 2020; 2020: 8926750.
[http://dx.doi.org/10.1155/2020/8926750] [PMID: 33133228]
[47]
Yang L, Gao H, Wu K, Zhang H, Li C, Tang L. Identification of cancerlectins by using cascade linear discriminant analysis and optimal g-gap tripeptide composition. Curr Bioinform 2020; 15(6): 528-37.
[http://dx.doi.org/10.2174/1574893614666190730103156]
[48]
Zou Q, Wan S, Ju Y, Tang J, Zeng X. Pretata: Predicting TA-TA binding proteins with novel features and dimensionality reduction strategy. BMC Syst Biol 2016; 10(Suppl. 4): 114.
[http://dx.doi.org/10.1186/s12918-016-0353-5] [PMID: 28155714]
[49]
Zhang ZY, Yang YH, Ding H, Wang D, Chen W, Lin H. De-sign powerful predictor for mRNA subcellular location pre-diction in Homo sapiens. Brief Bioinform 2021; 22(1): 526-35.
[http://dx.doi.org/10.1093/bib/bbz177] [PMID: 31994694]
[50]
Liu H, Setiono R. Incremental feature selection. Appl Intell 1998; 9(3): 217-30.
[http://dx.doi.org/10.1023/A:1008363719778]
[51]
Breiman L. Random Forests. Mach Learn 2001; 45(1): 5-32.
[http://dx.doi.org/10.1023/A:1010933404324]
[52]
Schaduangrat N, Nantasenamat C, Prachayasittikul V, Shoombuatong W. ACPred: A computational tool for the pre-diction and analysis of anticancer peptides. Molecules 2019; 24(10): 1973.
[http://dx.doi.org/10.3390/molecules24101973] [PMID: 31121946]
[53]
Win TS, Malik AA, Prachayasittikul V. S Wikberg JE, Nantasenamat C, Shoombuatong W. HemoPred: A web server for predicting the hemolytic activity of peptides. Future Med Chem 2017; 9(3): 275-91.
[http://dx.doi.org/10.4155/fmc-2016-0188] [PMID: 28211294]
[54]
Win TS, Schaduangrat N, Prachayasittikul V, Nantasenamat C, Shoombuatong W. PAAP: A web server for predicting antihy-pertensive activity of peptides. Future Med Chem 2018; 10(15): 1749-67.
[http://dx.doi.org/10.4155/fmc-2017-0300] [PMID: 30039980]
[55]
Shoombuatong W, Schaduangrat N, Nantasenamat C. Unravel-ing the bioactivity of anticancer peptides as deduced from machine learning. EXCLI J 2018; 17: 734-52.
[PMID: 30190664]
[56]
Charoenkwan P, Kanthawong S, Nantasenamat C, Hasan MM, Shoombuatong W. iDPPIV-SCM: A sequence-based predictor for identifying and analyzing dipeptidyl peptidase IV (DPP-IV) inhibitory peptides using a scoring card method. J Proteome Res 2020; 19(10): 4125-36.
[http://dx.doi.org/10.1021/acs.jproteome.0c00590] [PMID: 32897718]
[57]
Charoenkwan P, Yana J, Nantasenamat C, Hasan MM, Shoombuatong W. iUmami-SCM: A novel sequence-based predictor for prediction and analysis of umami peptides using a scoring card method with propensity scores of dipeptides. J Chem Inf Model 2020; 60(12): 6666-78.
[http://dx.doi.org/10.1021/acs.jcim.0c00707] [PMID: 33094610]
[58]
Pal M. Random forest classifier for remote sensing classifica-tion. Int J Remote Sens 2005; 26(1): 217-22.
[http://dx.doi.org/10.1080/01431160412331269698]
[59]
Ahmad F, Farooq A, Khan MUG, Shabbir MZ, Rabbani M, Hussain I. Identification of most relevant features for classifi-cation of francisella tularensis using machine learning. Curr Bioinform 2020; 15(10): 1197-212.
[http://dx.doi.org/10.2174/1574893615666200219113900]
[60]
Shang Y, Gao L, Zou Q, Yu L. Prediction of drug-target inter-actions based on multi-layer network representation learning. Neurocomputing 2021; 434: 80-9.
[http://dx.doi.org/10.1016/j.neucom.2020.12.068]
[61]
Fu X, Cai L, Zeng X, Zou Q. StackCPPred: a stacking and pairwise energy content-based prediction of cell-penetrating peptides and their uptake efficiency. Bioinformatics 2020; 36(10): 3028-34.
[http://dx.doi.org/10.1093/bioinformatics/btaa131] [PMID: 32105326]
[62]
Cheng L. Omics data and artificial intelligence: New challeng-es for gene therapy. Curr Gene Ther 2020; 20(1): 1.
[http://dx.doi.org/10.2174/156652322001200604150041] [PMID: 32603274]
[63]
Zhao X, Wang H, Li H, Wu Y, Wang G. Identifying plant pentatricopeptide repeat proteins using a variable selection method. Front Plant Sci 2021; 12: 506681.
[http://dx.doi.org/10.3389/fpls.2021.506681] [PMID: 33732270]
[64]
Lv H, Zhang ZM, Li SH, Tan JX, Chen W, Lin H. Evaluation of different computational methods on 5-methylcytosine sites identification. Brief Bioinform 2020; 21(3): 982-95.
[http://dx.doi.org/10.1093/bib/bbz048] [PMID: 31157855]
[65]
Lv H, Dao FY, Zhang D, et al. iDNA-MS: An integrated com-putational tool for detecting DNA modification sites in multi-ple genomes. iScience 2020; 23(4): 100991.
[http://dx.doi.org/10.1016/j.isci.2020.100991] [PMID: 32240948]
[66]
Wang J, Shi Y, Wang X, Chang H. A drug target interaction prediction based on LINE-RF learning. Curr Bioinform 2020; 15(7): 750-7.
[http://dx.doi.org/10.2174/1574893615666191227092453]
[67]
Cai L, Ren X, Fu X, Peng L, Gao M, Zeng X. iEnhancer-XG: Interpretable sequence-based enhancers and their strength predictor. Bioinformatics 2021; 37(8): 1060-7.
[http://dx.doi.org/10.1093/bioinformatics/btaa914] [PMID: 33119044]
[68]
Zhao X, Jiao Q, Li H, et al. ECFS-DEA: An ensemble classifi-er-based feature selection for differential expression analysis on expression profiles. BMC Bioinform 2020; 21(1): 43.
[http://dx.doi.org/10.1186/s12859-020-3388-y] [PMID: 32024464]
[69]
Xu H, Zeng W, Zeng X, Yen GG. A polar-metric-based evolu-tionary algorithm. IEEE Trans Cybern 2021; 51(7): 3429-40.
[http://dx.doi.org/10.1109/TCYB.2020.2965230] [PMID: 32031958]
[70]
Jin S, Zeng X, Xia F, Huang W. Liu XJBiB. Application of deep learning methods in biological networks. Brief Bioinform 2021; 22(5): 1902-17.
[http://dx.doi.org/10.1093/bib/bbaa043] [PMID: 32363401]
[71]
Wang X, Yang Y, Liu J, Wang G. The stacking strategy-based hybrid framework for identifying non-coding RNAs. Brief Bioinform 2021; 22(5): bbab023.
[http://dx.doi.org/10.1093/bib/bbab023] [PMID: 33693454]
[72]
Khan YD, Alzahrani E, Alghamdi W, Ullah MZ. Sequence-based identification of allergen proteins developed by integra-tion of PseAAC and statistical moments via 5-step rule. Curr Bioinform 2020; 15(9): 1046-55.
[http://dx.doi.org/10.2174/1574893615999200424085947]
[73]
Wang H, Liang P, Zheng L, Long C, Li H, Zuo Y. eHSCPr discriminating the cell identity involved in endothelial to hematopoietic transition. Bioinformatics 2021; 37(15): 2157-64.
[http://dx.doi.org/10.1093/bioinformatics/btab071] [PMID: 33532815]

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy