Generic placeholder image

Current Bioinformatics

Editor-in-Chief

ISSN (Print): 1574-8936
ISSN (Online): 2212-392X

Research Article

A Combined Feature Screening Approach of Random Forest and Filterbased Methods for Ultra-high Dimensional Data

Author(s): Lifeng Zhou and Hong Wang*

Volume 17, Issue 4, 2022

Published on: 14 April, 2022

Page: [344 - 357] Pages: 14

DOI: 10.2174/1574893617666220221120618

Price: $65

Abstract

Background: Various feature (variable) screening approaches have been proposed in the past decade to mitigate the impact of ultra-high dimensionality in classification and regression problems, including filter based methods such as sure independence screening, and wrapper based methods such as random forest. However, the former type of methods rely heavily on strong modelling assumptions while the latter ones requires an adequate sample size to make the data speak for themselves. These requirements can seldom be met in biochemical studies in cases where we have only access to ultra-high dimensional data with a complex structure and a small number of observations.

Objective: In this research, we want to investigate the possibility of combining both filter based screening methods and random forest based screening methods in the regression context.

Methods: We have combined four state-of-art filter approaches, namely, sure independence screening (SIS), robust rank correlation based screening (RRCS), high dimensional ordinary least squares projection (HOLP) and a model free sure independence screening procedure based on the distance correlation (DCSIS) from the statistical community with a random forest based Boruta screening method from the machine learning community for regression problems.

Results: Among all the combined methods, RF-DCSIS performs better than the other methods in terms of screening accuracy and prediction capability on the simulated scenarios and real benchmark datasets.

Conclusion: By empirical study from both extensive simulation and real data, we have shown that both filter based screening and random forest based screening have their pros and cons, while a combination of both may lead to a better feature screening result and prediction capability.

Keywords: Feature screening, filter-based method, ultra-high dimensional data, variable selection, random forest, RF-DCSIS.

Graphical Abstract

[1]
Hu Y, Lu Y, Wang S, Zhang M, Qu X, Niu B. Application of machine learning approaches for the design and study of anticancer drugs. Curr Drug Targets 2019; 20(5): 488-500.
[http://dx.doi.org/10.2174/1389450119666180809122244] [PMID: 30091413]
[2]
Niu B, Liang C, Lu Y, et al. Glioma stages prediction based on machine learning algorithm combined with protein-protein interaction net-works. Genomics 2020; 112(1): 837-47.
[http://dx.doi.org/10.1016/j.ygeno.2019.05.024] [PMID: 31150762]
[3]
Sarkar JP, Saha I, Sarkar A, Maulik U. Machine learning integrated ensemble of feature selection methods followed by survival analysis for predicting breast cancer subtype specific miRNA biomarkers. Comput Biol Med 2021; 131104244
[http://dx.doi.org/10.1016/j.compbiomed.2021.104244] [PMID: 33550016]
[4]
Hu Y, Zhou G, Zhang C, et al. Identify compounds’ target against Alzheimer’s Disease based on in-silico approach. Curr Alzheimer Res 2019; 16(3): 193-208.
[http://dx.doi.org/10.2174/1567205016666190103154855] [PMID: 30605059]
[5]
Niu B, Lu Y, Wang J, et al. 2D-SAR, topomer CoMFA and molecular docking studies on avian influenza neuraminidase inhibitors. Comput Struct Biotechnol J 2018; 17: 39-48.
[http://dx.doi.org/10.1016/j.csbj.2018.11.007] [PMID: 30595814]
[6]
Niu B, Liang R, Zhang S, et al. Epidemic analysis of COVID-19 in Italy based on spatiotemporal geographic information and Google Trends. Transbound Emerg Dis 2021; 68(4): 2384-400.
[http://dx.doi.org/10.1111/tbed.13902] [PMID: 33128853]
[7]
Zhang P, Li W, Ma X, He J, Huang J, Li Q. Feature-selection-based transfer learning for intracortical brain-machine interface decoding. IEEE Trans Neural Syst Rehabil Eng 2021; 29: 60-73.
[http://dx.doi.org/10.1109/TNSRE.2020.3034234] [PMID: 33108289]
[8]
Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007; 23(19): 2507-17.
[http://dx.doi.org/10.1093/bioinformatics/btm344] [PMID: 17720704]
[9]
Heinze G, Wallisch C, Dunkler D. Variable selection - A review and recommendations for the practicing statistician. Biom J 2018; 60(3): 431-49.
[http://dx.doi.org/10.1002/bimj.201700067] [PMID: 29292533]
[10]
Yang P, Huang H, Liu C. Feature selection revisited in the single-cell era. Genome Biol 2021; 22(1): 321.
[http://dx.doi.org/10.1186/s13059-021-02544-3] [PMID: 34847932]
[11]
Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res 2003; 3(Mar): 1157-82.
[12]
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Series B Stat Methodol 2008; 70(5): 849-911.
[http://dx.doi.org/10.1111/j.1467-9868.2008.00674.x]
[13]
Li G, Peng H, Zhang J, Zhu L, et al. Robust rank correlation based screening. Ann Stat 2012; 40(3): 1846-77.
[http://dx.doi.org/10.1214/12-AOS1024]
[14]
Li R, Zhong W, Zhu L. Feature screening via distance correlation learning. J Am Stat Assoc 2012; 107(499): 1129-39.
[http://dx.doi.org/10.1080/01621459.2012.695654] [PMID: 25249709]
[15]
Wang X, Leng C. High dimensional ordinary least squares projection for screening variables. J R Stat Soc Series B Stat Methodol 2016; 78(3): 589-611.
[http://dx.doi.org/10.1111/rssb.12127]
[16]
Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M. Benchmark for filter methods for feature selection in high-dimensional classifica-tion data. Comput Stat Data Anal 2020; 143106839
[http://dx.doi.org/10.1016/j.csda.2019.106839]
[17]
Gysels E, Renevey P, Celka P. Svm-based recursive feature elimination to compare phase synchronization computed from broadband and narrow-band eeg signals in brain–computer interfaces. Signal Processing 2005; 85(11): 2178-89.
[http://dx.doi.org/10.1016/j.sigpro.2005.07.008]
[18]
Kursa MB, Rudnicki WR, et al. Feature selection with the boruta package. J Stat Softw 2010; 36(11): 1-13.
[http://dx.doi.org/10.18637/jss.v036.i11]
[19]
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc B 1996; 58(1): 267-88.
[http://dx.doi.org/10.1111/j.2517-6161.1996.tb02080.x]
[20]
Xue J, Liang F. A robust model-free feature screening method for ultrahigh-dimensional data. J Comput Graph Stat 2017; 26(4): 803-13.
[http://dx.doi.org/10.1080/10618600.2017.1328364] [PMID: 30532512]
[21]
Ahmed T, Bajwa WU. Exsis: Extended sure independence screening for ultrahigh-dimensional linear models. Signal Processing 2019; 159: 33-48.
[http://dx.doi.org/10.1016/j.sigpro.2019.01.018]
[22]
Wang Y, Van Aelst S. Robust variable screening for regression using factor profiling. ASA Data Sci J 2019; 12(2): 70-87.
[http://dx.doi.org/10.1002/sam.11397]
[23]
Vapnik V. The nature of statistical learning theory. Springer science & business media 2013.
[24]
Wang M, Barbu A. Are screening methods useful in feature selection? An empirical study. PLoS One 2019; 14(9)e0220842
[http://dx.doi.org/10.1371/journal.pone.0220842] [PMID: 31509541]
[25]
Kursa MB. Robustness of random forest-based gene selection methods. BMC Bioinformatics 2014; 15(1): 8.
[http://dx.doi.org/10.1186/1471-2105-15-8] [PMID: 24410865]
[26]
Degenhardt F, Seifert S, Szymczak S. Evaluation of variable selection methods for random forests and omics data sets. Brief Bioinform 2019; 20(2): 492-503.
[http://dx.doi.org/10.1093/bib/bbx124] [PMID: 29045534]
[27]
Speiser JL, Miller ME, Tooze J, Ip E. A comparison of random forest variable selection methods for classification prediction modeling. Expert Syst Appl 2019; 134: 93-101.
[28]
Breiman L. Random forests. Mach Learn 2001; 45(1): 5-32.
[http://dx.doi.org/10.1023/A:1010933404324]
[29]
Liaw A, Wiener M, et al. Classification and regression by randomforest. R News 2002; 2(3): 18-22.
[30]
Wright MN, Ziegler A. ranger: A fast implementation of random forests for high dimensional data in c++ and r. J Stat Softw 2017; 77(1): 1-17.
[http://dx.doi.org/10.18637/jss.v077.i01]
[31]
JingYuan L Wei Z, RunZe LI. A selective overview of feature screening for ultrahigh-dimensional data. Sci China Math 2015; 58(10): 2033-54.
[PMID: 26779257]
[32]
Tang C, Garreau D, von Luxburg U. When do random forests fail?. Adv Neural Inform Proc Sys 2018; pp. 2983-93.
[33]
Zhou T, Zhu L, Xu C, Li R. Model-free forward screening via cumulative divergence. J Am Stat Assoc 2019; 2019: 1-36.
[http://dx.doi.org/10.1080/01621459.2018.1518234] [PMID: 33487782]
[34]
Fan J, Lv J. Sure independence screening. Wiley Stats 2018; pp. 1-8.
[35]
Qiu D, Ahn J. Grouped variable screening for ultra-high dimensional data for linear model. Comput Stat Data Anal 2020; 144106894
[http://dx.doi.org/10.1016/j.csda.2019.106894]
[36]
Yoshihara K, Tajima A, Yahata T, et al. Gene expression profile for predicting survival in advanced-stage serous ovarian cancer across two independent datasets. PLoS One 2010; 5(3)e9615
[http://dx.doi.org/10.1371/journal.pone.0009615] [PMID: 20300634]
[37]
Bonome T, Levine DA, Shih J, et al. A gene signature predicting for survival in suboptimally debulked patients with ovarian cancer. Cancer Res 2008; 68(13): 5478-86.
[http://dx.doi.org/10.1158/0008-5472.CAN-07-6595] [PMID: 18593951]
[38]
Sjödahl G, Lauss M, Lövgren K, et al. A molecular taxonomy for urothelial carcinoma. Clin Cancer Res 2012; 18(12): 3377-86.
[http://dx.doi.org/10.1158/1078-0432.CCR-12-0077-T] [PMID: 22553347]
[39]
Bilal E, Dutkowski J, Guinney J, et al. Improving breast cancer survival analysis through competition-based multidimensional modeling. PLOS Comput Biol 2013; 9(5)e1003047
[http://dx.doi.org/10.1371/journal.pcbi.1003047] [PMID: 23671412]
[40]
Rinnan R, Rinnan Å. Application of near infrared reflectance (nir) and fluorescence spectroscopy to analysis of microbiological and chem-ical properties of arctic soil. Soil Biol Biochem 2007; 39(7): 1664-73.
[http://dx.doi.org/10.1016/j.soilbio.2007.01.022]
[41]
van ’t Veer LJ, Dai H, van de Vijver MJ, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002; 415(6871): 530-6.
[http://dx.doi.org/10.1038/415530a] [PMID: 11823860]
[42]
Dietterich TG. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 1998; 10(7): 1895-923.
[http://dx.doi.org/10.1162/089976698300017197] [PMID: 9744903]
[43]
Huang X, Xu Q-S, Liang Y-Z. Pls regression based on sure independence screening for multivariate calibration. Anal Methods 2012; 4(9): 2815-21.
[http://dx.doi.org/10.1039/c2ay25032b]
[44]
Neykov N, Filzmoser P, Neytchev P. Ultrahigh dimensional variable selection through the penalized maximum trimmed likelihood estima-tor. Stat Hefte 2014; 55(1): 187-207.
[45]
Zhao N, Xu Q, Tang ML, Wang H. Variable screening for near infrared (NIR) spectroscopy data based on ridge partial least squares re-gression. Comb Chem High Throughput Screen 2020; 23(8): 740-56.
[http://dx.doi.org/10.2174/1386207323666200428114823] [PMID: 32342803]
[46]
Cheng CJ, Lin YC, Tsai MT, et al. SCUBE2 suppresses breast tumor cell proliferation and confers a favorable prognosis in invasive breast cancer. Cancer Res 2009; 69(8): 3634-41.
[http://dx.doi.org/10.1158/0008-5472.CAN-08-3615] [PMID: 19369267]
[47]
Chen JH, Kuo KT, Bamodu OA, et al. Upregulated SCUBE2 expression in breast cancer stem cells enhances triple negative breast cancer aggression through modulation of notch signaling and epithelial-to-mesenchymal transition. Exp Cell Res 2018; 370(2): 444-53.
[http://dx.doi.org/10.1016/j.yexcr.2018.07.008] [PMID: 29981340]
[48]
Song Q, Li C, Feng X, et al. Decreased expression of SCUBE2 is associated with progression and prognosis in colorectal cancer. Oncol Rep 2015; 33(4): 1956-64.
[http://dx.doi.org/10.3892/or.2015.3790] [PMID: 25672935]
[49]
Lin YC, Lee YC, Li LH, Cheng CJ, Yang RB. Tumor suppressor SCUBE2 inhibits breast-cancer cell migration and invasion through the reversal of epithelial-mesenchymal transition. J Cell Sci 2014; 127(Pt 1): 85-100.
[PMID: 24213532]
[50]
da Motta LL, Ledaki I, Purshouse K, et al. The BET inhibitor JQ1 selectively impairs tumour response to hypoxia and downregulates CA9 and angiogenesis in triple negative breast cancer. Oncogene 2017; 36(1): 122-32.
[http://dx.doi.org/10.1038/onc.2016.184] [PMID: 27292261]
[51]
Span PN, Bussink J, Manders P, Beex LVAM, Sweep CGJ. Carbonic anhydrase-9 expression levels and prognosis in human breast cancer: association with treatment outcome. Br J Cancer 2003; 89(2): 271-6.
[http://dx.doi.org/10.1038/sj.bjc.6601122] [PMID: 12865916]
[52]
Aomatsu N, Yashiro M, Kashiwagi S, et al. Prophylactic cranial irradiation for patients with small-cell lung cancer: A systematic review of the literature with meta-analysis. BMC Cancer 2014; 14(1): 1-10.
[http://dx.doi.org/10.1186/1471-2407-14-400] [PMID: 24383403]
[53]
Zhou W, Wang Z, Shen N, et al. Knockdown of ANLN by lentivirus inhibits cell growth and migration in human breast cancer. Mol Cell Biochem 2015; 398(1-2): 11-9.
[http://dx.doi.org/10.1007/s11010-014-2200-6] [PMID: 25223638]
[54]
Magnusson K, Gremel G, Rydén L, et al. ANLN is a prognostic biomarker independent of Ki-67 and essential for cell cycle progression in primary breast cancer. BMC Cancer 2016; 16(1): 904.
[http://dx.doi.org/10.1186/s12885-016-2923-8] [PMID: 27863473]
[55]
Chimge NO, Baniwal SK, Luo J, et al. Opposing effects of Runx2 and estradiol on breast cancer cell proliferation: in vitro identification of reciprocally regulated gene signature related to clinical letrozole responsiveness. Clin Cancer Res 2012; 18(3): 901-11.
[http://dx.doi.org/10.1158/1078-0432.CCR-11-1530] [PMID: 22147940]
[56]
Cangi MG, Cukor B, Soung P, et al. Role of the Cdc25A phosphatase in human breast cancer. J Clin Invest 2000; 106(6): 753-61.
[http://dx.doi.org/10.1172/JCI9174] [PMID: 10995786]
[57]
Nodale C, Sheffer M, Jacob-Hirsch J, et al. HIPK2 downregulates vimentin and inhibits breast cancer cell invasion. Cancer Biol Ther 2012; 13(4): 198-205.
[http://dx.doi.org/10.4161/cbt.13.4.18694] [PMID: 22236966]
[58]
Majumder A, Singh M, Tyagi SC. Post-menopausal breast cancer: from estrogen to androgen receptor. Oncotarget 2017; 8(60): 102739-58.
[http://dx.doi.org/10.18632/oncotarget.22156] [PMID: 29254284]
[59]
Zheng W, Ranoa DRE, Huang X, et al. RIG-I-like receptor LGP2 is required for tumor control by radiotherapy. Cancer Res 2020; 80(24): 5633-41.
[http://dx.doi.org/10.1158/0008-5472.CAN-20-2324] [PMID: 33087322]
[60]
Nair NM, Mills DC. Poly-4-hydroxybutyrate (P4HB) scaffold internal support: preliminary experience with direct implant opposition during complex breast revisions. Aesthet Surg J 2019; 39(11): 1203-13.
[http://dx.doi.org/10.1093/asj/sjy276] [PMID: 30321265]
[61]
Chen Y, Liu J, Li L, Xia H, Lin Z, Zhong T. AMPH-1 is critical for breast cancer progression. J Cancer 2018; 9(12): 2175-82.
[http://dx.doi.org/10.7150/jca.25428] [PMID: 29937937]
[62]
Wang L, Wang Y, Su B, Yu P, He J, Meng L, et al. Atoms in separated resonators can jointly absorb a single photon. Sci Rep 2020; 10(1): 1-16.
[http://dx.doi.org/10.1038/s41598-019-56847-4] [PMID: 31913322]

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy