Generic placeholder image

Current Bioinformatics

Editor-in-Chief

ISSN (Print): 1574-8936
ISSN (Online): 2212-392X

Research Article

Construction of Network Biomarkers Using Inter-Feature Correlation Coefficients (FeCO3) and their Application in Detecting High-Order Breast Cancer Biomarkers

Author(s): Shenggeng Lin, Yuqi Lin, Kexin Wu, Yueying Wang, Zixuan Feng, Meiyu Duan, Shuai Liu, Yusi Fan, Lan Huang and Fengfeng Zhou*

Volume 17, Issue 4, 2022

Published on: 11 April, 2022

Page: [310 - 326] Pages: 17

DOI: 10.2174/1574893617666220124123303

Price: $65

Abstract

Aims: This study aims to formulate the inter-feature correlation as the engineered features.

Background: Modern biotechnologies tend to generate a huge number of characteristics of a sample, while an OMIC dataset usually has a few dozens or hundreds of samples due to the high costs of generating the OMIC data. Therefore, many bio-OMIC studies assumed inter-feature independence and selected a feature with a high phenotype association.

Objective: Many features are closely associated with each other due to their physical or functional interactions, which may be utilized as a new view of features.

Methods: This study proposed a feature engineering algorithm based on the correlation coefficients (FeCO3) by utilizing the correlations between a given sample and a few reference samples. A comprehensive evaluation was carried out for the proposed FeCO3 network features using 24 bio-OMIC datasets.

Results: The experimental data suggested that the newly calculated FeCO3 network features tended to achieve better classification performances than the original features, using the same popular feature selection and classification algorithms. The FeCO3 network features were also consistently supported by the literature. FeCO3 was utilized to investigate the high-order engineered biomarkers of breast cancer and detected the PBX2 gene (Pre-B-Cell Leukemia Transcription Factor 2) as one of the candidate breast cancer biomarkers. Although the two methylated residues cg14851325 (P-value = 8.06e-2) and cg16602460 (Pvalue = 1.19e-1) within PBX2 did not have a statistically significant association with breast cancers, the high-order inter-feature correlations showed a significant association with breast cancers.

Conclusion: The proposed FeCO3 network features calculated the high-order inter-feature correlations as novel features and may facilitate the investigations of complex diseases from this new perspective. The source code is available on FigShare at 10.6084/m9.figshare.13550051 or the web site http://www.healthinformaticslab.org/supp/.

Keywords: Pearson correlation coefficient, Spearman correlation coefficient, feature selection, feature construction, feature engineering, FeCO3.

Graphical Abstract

[1]
Rappoport N, Shamir R. NEMO: Cancer subtyping by integration of partial multi-omic data. Bioinformatics 2019; 35(18): 3348-56.
[http://dx.doi.org/10.1093/bioinformatics/btz058] [PMID: 30698637]
[2]
Bossé Y, Amos CI. A decade of GWAS results in lung cancer. Cancer Epidemiol Biomarkers Prev 2018; 27(4): 363-79.
[http://dx.doi.org/10.1158/1055-9965.EPI-16-0794] [PMID: 28615365]
[3]
Zoh RS, Sarkar A, Carroll RJ, Mallick BK. A powerful bayesian test for equality of means in high dimensions. J Am Stat Assoc 2018; 113(524): 1733-41.
[http://dx.doi.org/10.1080/01621459.2017.1371024] [PMID: 30739967]
[4]
Cueto-López N, García-Ordás MT, Dávila-Batista V, Moreno V, Aragonés N, Alaiz-Rodríguez R. A comparative study on feature selection for a risk prediction model for colorectal cancer. Comput Methods Programs Biomed 2019; 177: 219-29.
[http://dx.doi.org/10.1016/j.cmpb.2019.06.001] [PMID: 31319951]
[5]
Aydin EA. Subject-Specific feature selection for near infrared spectroscopy based brain-computer interfaces. Comput Methods Programs Biomed 2020; 195: 105535.
[http://dx.doi.org/10.1016/j.cmpb.2020.105535] [PMID: 32534382]
[6]
MotieGhader H Masoudi-Sobhanzadeh Y, Ashtiani SH, Masoudi-Nejad A. mRNA and microRNA selection for breast cancer molecular subtype stratification using meta-heuristic based algorithms. Genomics 2020; 112(5): 3207-17.
[http://dx.doi.org/10.1016/j.ygeno.2020.06.014] [PMID: 32526247]
[7]
Tian S, Wang C, Zhang J, Yu D. The cox-filter method identifies respective subtype-specific lncRNA prognostic signatures for two human cancers. BMC Med Genomics 2020; 13(1): 18.
[http://dx.doi.org/10.1186/s12920-020-0691-4] [PMID: 32024523]
[8]
Alirezanejad M, Enayatifar R, Motameni H, Nematzadeh H. Heuristic filter feature selection methods for medical datasets. Genomics 2020; 112(2): 1173-81.
[http://dx.doi.org/10.1016/j.ygeno.2019.07.002] [PMID: 31276753]
[9]
Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007; 23(19): 2507-17.
[http://dx.doi.org/10.1093/bioinformatics/btm344] [PMID: 17720704]
[10]
Golub TR, Slonim DK, Tamayo P, et al. Molecular classification of cancer: Class discovery and class prediction by gene expression moni-toring. Science 1999; 286(5439): 531-7.
[http://dx.doi.org/10.1126/science.286.5439.531] [PMID: 10521349]
[11]
Sahebi G, Movahedi P, Ebrahimi M, Pahikkala T, Plosila J, Tenhunen H. GeFeS: A generalized wrapper feature selection approach for optimizing classification performance. Comput Biol Med 2020; 125: 103974.
[http://dx.doi.org/10.1016/j.compbiomed.2020.103974] [PMID: 32890978]
[12]
Redkar S, Mondal S, Joseph A, Hareesha KS. A machine learning approach for drug-target interaction prediction using wrapper feature selection and class balancing. Mol Inform 2020; 39(5): e1900062.
[http://dx.doi.org/10.1002/minf.201900062] [PMID: 32003548]
[13]
Zhu Z, Ong YS, Dash M. Wrapper-filter feature selection algorithm using a memetic framework. IEEE Trans Syst Man Cybern B Cybern 2007; 37(1): 70-6.
[http://dx.doi.org/10.1109/TSMCB.2006.883267] [PMID: 17278560]
[14]
He Z, Yu W. Stable feature selection for biomarker discovery. Comput Biol Chem 2010; 34(4): 215-25.
[http://dx.doi.org/10.1016/j.compbiolchem.2010.07.002] [PMID: 20702140]
[15]
Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y. Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 2010; 26(3): 392-8.
[http://dx.doi.org/10.1093/bioinformatics/btp630] [PMID: 19942583]
[16]
Lazar C, Taminau J, Meganck S, et al. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans Comput Biol Bioinformatics 2012; 9(4): 1106-19.
[http://dx.doi.org/10.1109/TCBB.2012.33] [PMID: 22350210]
[17]
Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 2005; 3(2): 185-205.
[http://dx.doi.org/10.1142/S0219720005001004] [PMID: 15852500]
[18]
Chuang L-Y, Chang H-W, Tu C-J, Yang CH. Improved binary PSO for feature selection using gene expression data. Comput Biol Chem 2008; 32(1): 29-37.
[http://dx.doi.org/10.1016/j.compbiolchem.2007.09.005] [PMID: 18023261]
[19]
Lu H, Chen J, Yan K, et al. A hybrid feature selection algorithm for gene expression data classification. Neurocomputing 2017; 256: 56-62.
[http://dx.doi.org/10.1016/j.neucom.2016.07.080]
[20]
Liu H, Li J, Wong L. A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome Inform 2002; 13: 51-60.
[PMID: 14571374]
[21]
Sharma A, Imoto S, Miyano S. A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Trans Comput Biol Bioinformatics 2012; 9(3): 754-64.
[http://dx.doi.org/10.1109/TCBB.2011.151] [PMID: 22084149]
[22]
Li T, Zhang C, Ogihara M. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 2004; 20(15): 2429-37.
[http://dx.doi.org/10.1093/bioinformatics/bth267] [PMID: 15087314]
[23]
Chandra B, Gupta M. An efficient statistical feature selection approach for classification of gene expression data. J Biomed Inform 2011; 44(4): 529-35.
[http://dx.doi.org/10.1016/j.jbi.2011.01.001] [PMID: 21241823]
[24]
He S, Guo F, Zou Q. MRMD2. 0: a python tool for machine learning with feature ranking and reduction. Curr Bioinform 2020; 15: 1213-21.
[http://dx.doi.org/10.2174/1574893615999200503030350]
[25]
Zhou LT, Cao YH, Lv LL, et al. Feature selection and classification of urinary mRNA microarray data by iterative random forest to diag-nose renal fibrosis: A two-stage study. Sci Rep 2017; 7: 39832.
[http://dx.doi.org/10.1038/srep39832] [PMID: 28045061]
[26]
Sun H, Wang S. Penalized logistic regression for high-dimensional DNA methylation data with case-control studies. Bioinformatics 2012; 28(10): 1368-75.
[http://dx.doi.org/10.1093/bioinformatics/bts145] [PMID: 22467913]
[27]
Tang F, Zhang L, Xu L, Zou Q, Feng H. The accurate prediction and characterization of cancerlectin by a combined machine learning and GO analysis. Brief Bioinform 2021; 22(6): 22.
[http://dx.doi.org/10.1093/bib/bbab227] [PMID: 34113984]
[28]
Yang F, Zou Q. DisBalance: a platform to automatically build balance-based disease prediction models and discover microbial biomarkers from microbiome data. Brief Bioinform 2021; 22(5): 22.
[http://dx.doi.org/10.1093/bib/bbab094] [PMID: 33834198]
[29]
Lv Z, Cui F, Zou Q, Zhang L, Xu L. Anticancer peptides prediction with deep representation learning features. Brief Bioinform 2021; 22(5): 22.
[http://dx.doi.org/10.1093/bib/bbab008] [PMID: 33529337]
[30]
Yousaf N, Hussein S, Sultani W. Estimation of BMI from facial images using semantic segmentation based region-aware pooling. Comput Biol Med 2021; 133: 104392.
[http://dx.doi.org/10.1016/j.compbiomed.2021.104392] [PMID: 33895458]
[31]
Yang S. Feature engineering in fine-grained image classification 2013. Available from: https://digital.lib.washington.edu/research works/handle/1773/23376
[32]
Scott S, Matwin S. Feature engineering for text classification. In: ICML. 1999; pp. 379-88.
[33]
Mohanaiah P, Sathyanarayana P. GuruKumar L. Image texture feature extraction using GLCM approach. Int J Sci 2013; 3: 1-5.
[34]
Liu X, Zhang R, Meng Z, et al. On fusing the latent deep CNN feature for image classification. World Wide Web (Bussum) 2019; 22: 423-36.
[http://dx.doi.org/10.1007/s11280-018-0600-3]
[35]
Wu M, Liu F, Cohn T. Evaluating the utility of hand-crafted features in sequence labelling. arXiv 2018. 2018: 1310
[http://dx.doi.org/10.18653/v1/D18-1310]
[36]
Bonidia RP, Domingues DS, Sanches DS, de Carvalho ACPLF. MathFeature: feature extraction package for DNA, RNA and protein se-quences based on mathematical descriptors. Brief Bioinform 2021; 2021: bbab434.
[http://dx.doi.org/10.1093/bib/bbab434] [PMID: 34750626]
[37]
Khanal J, Tayara H, Zou Q, Chong KT. Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model rely-ing on distributed feature representation. Comput Struct Biotechnol J 2021; 19: 1612-9.
[http://dx.doi.org/10.1016/j.csbj.2021.03.015] [PMID: 33868598]
[38]
Yu X, Zhang J, Sun S, Zhou X, Zeng T, Chen L. Individual-specific edge-network analysis for disease prediction. Nucleic Acids Res 2017; 45(20): e170.
[http://dx.doi.org/10.1093/nar/gkx787] [PMID: 28981699]
[39]
Ge R, Zhou M, Luo Y, et al. McTwo: A two-step feature selection algorithm based on maximal information coefficient. BMC Bioinformatics 2016; 17: 142.
[http://dx.doi.org/10.1186/s12859-016-0990-0] [PMID: 27006077]
[40]
Zhang S, Lu Y, Qi L, Wang H, Wang Z, Cai Z. AHNAK2 is associated with poor prognosis and cell migration in lung Adenocarcinoma. BioMed Res Int 2020; 2020: 8571932.
[http://dx.doi.org/10.1155/2020/8571932] [PMID: 32904605]
[41]
Chen C, Tang J, Xu S, Zhang W, Jiang H. miR-30a-5p inhibits proliferation and migration of lung squamous cell carcinoma cells by target-ing FOXD1. BioMed Res Int 2020; 2020: 2547902.
[http://dx.doi.org/10.1155/2020/2547902] [PMID: 32351986]
[42]
Edgar R, Domrachev M, Lash AE. Gene expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 2002; 30(1): 207-10.
[http://dx.doi.org/10.1093/nar/30.1.207] [PMID: 11752295]
[43]
Dogan MV, Shields B, Cutrona C, et al. The effect of smoking on DNA methylation of peripheral blood mononuclear cells from African American women. BMC Genomics 2014; 15: 151.
[http://dx.doi.org/10.1186/1471-2164-15-151] [PMID: 24559495]
[44]
Senders JT, Karhade AV, Cote DJ, et al. Natural language processing for automated quantification of brain metastases reported in free-text radiology reports. JCO Clin Cancer Inform 2019; 3: 1-9.
[http://dx.doi.org/10.1200/CCI.18.00138] [PMID: 31002562]
[45]
Guo P, Luo Y, Mai G, et al. Gene expression profile based classification models of psoriasis. Genomics 2014; 103(1): 48-55.
[http://dx.doi.org/10.1016/j.ygeno.2013.11.001] [PMID: 24239985]
[46]
Nguyen DH, Patrick JD. Supervised machine learning and active learning in classification of radiology reports. J Am Med Inform Assoc 2014; 21(5): 893-901.
[http://dx.doi.org/10.1136/amiajnl-2013-002516] [PMID: 24853067]
[47]
Peng Z, Xing Q, Kurgan L. APOD: accurate sequence-based predictor of disordered flexible linkers. Bioinformatics 2020; 36(Suppl. 2): i754-61.
[PMID: 33381830]
[48]
Lv H, Dao FY, Guan ZX, et al. Deep-Kcr: Accurate detection of lysine crotonylation sites using deep learning method. Brief Bioinform 2020; 2020: bbaa255.
[PMID: 33099604]
[49]
Yang L, Fu B, Li Y, et al. Prediction model of the response to neoadjuvant chemotherapy in breast cancers by a Naive Bayes algorithm. Comput Methods Programs Biomed 2020; 192: 105458.
[http://dx.doi.org/10.1016/j.cmpb.2020.105458] [PMID: 32302875]
[50]
Ghiasi MM, Zendehboudi S. Application of decision tree-based ensemble learning in the classification of breast cancer. Comput Biol Med 2021; 128: 104089.
[http://dx.doi.org/10.1016/j.compbiomed.2020.104089] [PMID: 33338982]
[51]
Ghiasi MM, Zendehboudi S, Mohsenipour AA. Decision tree-based diagnosis of coronary artery disease: CART model. Comput Methods Programs Biomed 2020; 192: 105400.
[http://dx.doi.org/10.1016/j.cmpb.2020.105400] [PMID: 32179311]
[52]
Dong X, Lin L, Zhang R, et al. TOBMI: Trans-omics block missing data imputation using a k-nearest neighbor weighted approach. Bioinformatics 2019; 35(8): 1278-83.
[http://dx.doi.org/10.1093/bioinformatics/bty796] [PMID: 30202885]
[53]
Pregibon D. Logistic regression diagnostics. Ann Stat 1981; 9: 705-24.
[http://dx.doi.org/10.1214/aos/1176345513]
[54]
Pal M. Random forest classifier for remote sensing classification. Int J Remote Sens 2005; 26: 217-22.
[http://dx.doi.org/10.1080/01431160412331269698]
[55]
Li Y, Peng Y, Yao S, et al. Association of miR-155 and angiotensin receptor type 1 polymorphisms with the risk of ischemic stroke in a Chinese population. DNA Cell Biol 2019; 39(1): 92-104.
[PMID: 31721599]
[56]
Haridas V, Ni J, Meager A, et al. TRANK, a novel cytokine that activates NF-kappa B and c-Jun N-terminal kinase. J Immunol 1998; 161(1): 1-6.
[PMID: 9647199]
[57]
Fazio F, D’Iglio C, Capillo G, et al. Environmental investigations and tissue bioaccumulation of heavy metals in grey mullet from the black sea (Bulgaria) and the ionian sea (Italy). Animals (Basel) 2020; 10(10): 10.
[http://dx.doi.org/10.3390/ani10101739] [PMID: 32987958]
[58]
Liang X, Wang X, He Y, et al. Acetylation dependent functions of Rab22a-NeoF1 Fusion Protein in Osteosarcoma. Theranostics 2020; 10(17): 7747-57.
[http://dx.doi.org/10.7150/thno.46082] [PMID: 32685017]
[59]
Liu WM, Mei R, Di X, et al. Analysis of high density expression microarrays with signed-rank call algorithms. Bioinformatics 2002; 18(12): 1593-9.
[http://dx.doi.org/10.1093/bioinformatics/18.12.1593] [PMID: 12490443]
[60]
Mucha A, Zatoń-Dobrowolska M, Moska M, et al. How selective breeding has changed the morphology of the American Mink (Neovison vison)-A comparative analysis of farm and feral animals. Animals (Basel) 2021; 11(1): 11.
[http://dx.doi.org/10.3390/ani11010106] [PMID: 33430282]
[61]
Lanzola G, Bagarotti R, Sacchi L, et al. Bringing spatiotemporal gait analysis into clinical practice: Instrument validation and pilot study of a commercial sensorized carpet. Comput Methods Programs Biomed 2020; 188: 105292.
[http://dx.doi.org/10.1016/j.cmpb.2019.105292] [PMID: 31923818]
[62]
Mortazavi A, Moattar MH. Robust feature selection from microarray data based on cooperative game theory and qualitative mutual infor-mation. Adv Bioinforma 2016; 2016: 1058305.
[http://dx.doi.org/10.1155/2016/1058305] [PMID: 27127506]
[63]
Wang Y, Liu H, Fan Y, et al. In silico prediction of human intravenous pharmacokinetic parameters with improved accuracy. J Chem Inf Model 2019; 59(9): 3968-80.
[http://dx.doi.org/10.1021/acs.jcim.9b00300] [PMID: 31403793]
[64]
Ye Y, Zhang R, Zheng W, Liu S, Zhou F. RIFS: A randomly restarted incremental feature selection algorithm. Sci Rep 2017; 7(1): 13013.
[http://dx.doi.org/10.1038/s41598-017-13259-6] [PMID: 29026108]
[65]
Waldmann P. On the use of the pearson correlation coefficient for model evaluation in genome-wide prediction. Front Genet 2019; 10: 899.
[http://dx.doi.org/10.3389/fgene.2019.00899] [PMID: 31632436]
[66]
Rauschert S, Melton PE, Burdge G, et al. Maternal smoking during pregnancy induces persistent epigenetic changes into adolescence, in-dependent of postnatal smoke exposure and is associated with cardiometabolic risk. Front Genet 2019; 10: 770.
[http://dx.doi.org/10.3389/fgene.2019.00770] [PMID: 31616461]
[67]
Bergens MA, Pittman GS, Thompson IJB, et al. Smoking-associated AHRR demethylation in cord blood DNA: impact of CD235a+ nucle-ated red blood cells. Clin Epigenetics 2019; 11(1): 87.
[http://dx.doi.org/10.1186/s13148-019-0686-1] [PMID: 31182156]
[68]
Haase T, Müller C, Krause J, et al. Novel DNA methylation sites influence GPR15 expression in relation to smoking. Biomolecules 2018; 8(3): 8.
[http://dx.doi.org/10.3390/biom8030074] [PMID: 30127295]
[69]
Cai J, Xu Y, Zhang W, et al. A comprehensive comparison of residue-level methylation levels with the regression-based gene-level meth-ylation estimations by ReGear. Brief Bioinform 2020; 22(4): 1-18.
[PMID: 33048108]
[70]
Chen Z, Pang M, Zhao Z, et al. Feature selection may improve deep neural networks for the bioinformatics problems. Bioinformatics 2020; 36(5): 1542-52.
[PMID: 31591638]
[71]
Medina-Aguilar R, Pérez-Plasencia C, Gariglio P, et al. DNA methylation data for identification of epigenetic targets of resveratrol in triple negative breast cancer cells. Data Brief 2017; 11: 169-82.
[http://dx.doi.org/10.1016/j.dib.2017.02.006] [PMID: 28229117]
[72]
Hou H, Lyu Y, Jiang J, et al. Peripheral blood transcriptome identifies high-risk benign and malignant breast lesions. PLoS One 2020; 15(6): e0233713.
[http://dx.doi.org/10.1371/journal.pone.0233713] [PMID: 32497068]
[73]
Tarazona A, Forment J, Elena SF. Identifying early warning signals for the sudden transition from mild to severe tobacco etch disease by dynamical network biomarkers. Viruses 2019; 12(1): 12.
[http://dx.doi.org/10.3390/v12010016] [PMID: 31861938]

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy