Abstract
Background: Dimension disaster is often associated with feature extraction. The extracted features may contain more redundant feature information, which leads to the limitation of computing ability and overfitting problems.
Objective: Feature selection is an important strategy to overcome the problems from dimension disaster. In most machine learning tasks, features determine the upper limit of the model performance. Therefore, more and more feature selection methods should be developed to optimize redundant features.
Methods: In this paper, we introduce a new technique to optimize sequence features based on the Binomial Distribution (BD). Firstly, the principle of the binomial distribution algorithm is introduced in detail. Then, the proposed algorithm is compared with other commonly used feature selection methods on three different types of datasets by using a Random Forest classifier with the same parameters.
Results: The results confirm that BD has a promising improvement in feature selection and classification accuracy.
Conclusion: Finally, we provide the source code and executable program package (http: //lingroup. cn/server/BDselect/), by which users can easily perform our algorithm in their researches.
Keywords: Dimension disasters, feature selection, binomial distribution, machine learning, random forest classifier, datasets.
Graphical Abstract
[http://dx.doi.org/10.1136/amiajnl-2014-002974] [PMID: 25008006]
[http://dx.doi.org/10.1093/bib/bby090] [PMID: 30239587]
[http://dx.doi.org/10.1093/nar/gkz843] [PMID: 31584099]
[http://dx.doi.org/10.2174/1574893615999200425230056]
[http://dx.doi.org/10.1371/journal.pcbi.1008696] [PMID: 33561121]
[http://dx.doi.org/10.1093/bioinformatics/btaa428] [PMID: 32467970]
[http://dx.doi.org/10.2174/156652321904191022113307] [PMID: 31762421]
[http://dx.doi.org/10.1016/j.omtn.2020.02.004] [PMID: 32169803]
[http://dx.doi.org/10.1093/bioinformatics/bty827] [PMID: 30247625]
[http://dx.doi.org/10.2174/1574893615999200503030350]
[http://dx.doi.org/10.2174/1573406415666191002152441] [PMID: 31339073]
[http://dx.doi.org/10.3390/ijms21145014] [PMID: 32708644]
[http://dx.doi.org/10.1016/j.omtn.2019.09.019] [PMID: 31678735]
[http://dx.doi.org/10.1109/TCBB.2008.35] [PMID: 20150666]
[http://dx.doi.org/10.1186/s12859-016-1423-9] [PMID: 28049413]
[http://dx.doi.org/10.2174/1574893615666200204154358]
[http://dx.doi.org/10.1016/j.ymeth.2020.08.006] [PMID: 32798653]
[http://dx.doi.org/10.1016/j.ins.2009.02.014]
[http://dx.doi.org/10.2174/157489361405190628122355]
[http://dx.doi.org/10.4236/jsip.2013.43B031]
[http://dx.doi.org/10.1039/C4MB00316K] [PMID: 24931825]
[http://dx.doi.org/10.1093/bib/bbaa342] [PMID: 33316032]
[http://dx.doi.org/10.1016/j.snb.2015.02.025]
[http://dx.doi.org/10.1016/j.patcog.2007.06.035]
[http://dx.doi.org/10.3389/fgene.2019.00094] [PMID: 30891058]
[http://dx.doi.org/10.1093/bib/bbz123] [PMID: 31633777]
[http://dx.doi.org/10.1016/j.ygeno.2020.08.016] [PMID: 32818637]
[http://dx.doi.org/10.1093/bioinformatics/btw564] [PMID: 27565583]
[http://dx.doi.org/10.1093/bib/bbaa063] [PMID: 32438416]
[http://dx.doi.org/10.1093/database/baz131] [PMID: 31802128]
[http://dx.doi.org/10.1093/bib/bbaa255] [PMID: 33099604]
[http://dx.doi.org/10.2174/1574893614666181212102749]
[http://dx.doi.org/10.1016/j.chemolab.2014.12.011]
[http://dx.doi.org/10.1093/bioinformatics/bty943] [PMID: 30428009]
[http://dx.doi.org/10.18632/oncotarget.9057] [PMID: 27147572]
[http://dx.doi.org/10.1093/bib/bbaa017] [PMID: 32065211]
[http://dx.doi.org/10.1016/j.csbj.2020.04.015] [PMID: 32435427]
[http://dx.doi.org/10.1126/science.1249340] [PMID: 24626918]
[http://dx.doi.org/10.1016/j.omtn.2020.07.035] [PMID: 33294291]
[http://dx.doi.org/10.1038/s41586-019-1182-7] [PMID: 31092938]
[http://dx.doi.org/10.1093/bib/bbab042] [PMID: 33580783]
[http://dx.doi.org/10.1109/TPAMI.2005.159] [PMID: 16119262]
[http://dx.doi.org/10.1016/j.neucom.2014.12.123]
[http://dx.doi.org/10.1155/2020/8926750] [PMID: 33133228]
[http://dx.doi.org/10.2174/1574893614666190730103156]
[http://dx.doi.org/10.1186/s12918-016-0353-5] [PMID: 28155714]
[http://dx.doi.org/10.1093/bib/bbz177] [PMID: 31994694]
[http://dx.doi.org/10.1023/A:1008363719778]
[http://dx.doi.org/10.1023/A:1010933404324]
[http://dx.doi.org/10.3390/molecules24101973] [PMID: 31121946]
[http://dx.doi.org/10.4155/fmc-2016-0188] [PMID: 28211294]
[http://dx.doi.org/10.4155/fmc-2017-0300] [PMID: 30039980]
[PMID: 30190664]
[http://dx.doi.org/10.1021/acs.jproteome.0c00590] [PMID: 32897718]
[http://dx.doi.org/10.1021/acs.jcim.0c00707] [PMID: 33094610]
[http://dx.doi.org/10.1080/01431160412331269698]
[http://dx.doi.org/10.2174/1574893615666200219113900]
[http://dx.doi.org/10.1016/j.neucom.2020.12.068]
[http://dx.doi.org/10.1093/bioinformatics/btaa131] [PMID: 32105326]
[http://dx.doi.org/10.2174/156652322001200604150041] [PMID: 32603274]
[http://dx.doi.org/10.3389/fpls.2021.506681] [PMID: 33732270]
[http://dx.doi.org/10.1093/bib/bbz048] [PMID: 31157855]
[http://dx.doi.org/10.1016/j.isci.2020.100991] [PMID: 32240948]
[http://dx.doi.org/10.2174/1574893615666191227092453]
[http://dx.doi.org/10.1093/bioinformatics/btaa914] [PMID: 33119044]
[http://dx.doi.org/10.1186/s12859-020-3388-y] [PMID: 32024464]
[http://dx.doi.org/10.1109/TCYB.2020.2965230] [PMID: 32031958]
[http://dx.doi.org/10.1093/bib/bbaa043] [PMID: 32363401]
[http://dx.doi.org/10.1093/bib/bbab023] [PMID: 33693454]
[http://dx.doi.org/10.2174/1574893615999200424085947]
[http://dx.doi.org/10.1093/bioinformatics/btab071] [PMID: 33532815]