Abstract
Aims: This study aims to formulate the inter-feature correlation as the engineered features.
Background: Modern biotechnologies tend to generate a huge number of characteristics of a sample, while an OMIC dataset usually has a few dozens or hundreds of samples due to the high costs of generating the OMIC data. Therefore, many bio-OMIC studies assumed inter-feature independence and selected a feature with a high phenotype association.
Objective: Many features are closely associated with each other due to their physical or functional interactions, which may be utilized as a new view of features.
Methods: This study proposed a feature engineering algorithm based on the correlation coefficients (FeCO3) by utilizing the correlations between a given sample and a few reference samples. A comprehensive evaluation was carried out for the proposed FeCO3 network features using 24 bio-OMIC datasets.
Results: The experimental data suggested that the newly calculated FeCO3 network features tended to achieve better classification performances than the original features, using the same popular feature selection and classification algorithms. The FeCO3 network features were also consistently supported by the literature. FeCO3 was utilized to investigate the high-order engineered biomarkers of breast cancer and detected the PBX2 gene (Pre-B-Cell Leukemia Transcription Factor 2) as one of the candidate breast cancer biomarkers. Although the two methylated residues cg14851325 (P-value = 8.06e-2) and cg16602460 (Pvalue = 1.19e-1) within PBX2 did not have a statistically significant association with breast cancers, the high-order inter-feature correlations showed a significant association with breast cancers.
Conclusion: The proposed FeCO3 network features calculated the high-order inter-feature correlations as novel features and may facilitate the investigations of complex diseases from this new perspective. The source code is available on FigShare at 10.6084/m9.figshare.13550051 or the web site http://www.healthinformaticslab.org/supp/.
Keywords: Pearson correlation coefficient, Spearman correlation coefficient, feature selection, feature construction, feature engineering, FeCO3.
Graphical Abstract
[http://dx.doi.org/10.1093/bioinformatics/btz058] [PMID: 30698637]
[http://dx.doi.org/10.1158/1055-9965.EPI-16-0794] [PMID: 28615365]
[http://dx.doi.org/10.1080/01621459.2017.1371024] [PMID: 30739967]
[http://dx.doi.org/10.1016/j.cmpb.2019.06.001] [PMID: 31319951]
[http://dx.doi.org/10.1016/j.cmpb.2020.105535] [PMID: 32534382]
[http://dx.doi.org/10.1016/j.ygeno.2020.06.014] [PMID: 32526247]
[http://dx.doi.org/10.1186/s12920-020-0691-4] [PMID: 32024523]
[http://dx.doi.org/10.1016/j.ygeno.2019.07.002] [PMID: 31276753]
[http://dx.doi.org/10.1093/bioinformatics/btm344] [PMID: 17720704]
[http://dx.doi.org/10.1126/science.286.5439.531] [PMID: 10521349]
[http://dx.doi.org/10.1016/j.compbiomed.2020.103974] [PMID: 32890978]
[http://dx.doi.org/10.1002/minf.201900062] [PMID: 32003548]
[http://dx.doi.org/10.1109/TSMCB.2006.883267] [PMID: 17278560]
[http://dx.doi.org/10.1016/j.compbiolchem.2010.07.002] [PMID: 20702140]
[http://dx.doi.org/10.1093/bioinformatics/btp630] [PMID: 19942583]
[http://dx.doi.org/10.1109/TCBB.2012.33] [PMID: 22350210]
[http://dx.doi.org/10.1142/S0219720005001004] [PMID: 15852500]
[http://dx.doi.org/10.1016/j.compbiolchem.2007.09.005] [PMID: 18023261]
[http://dx.doi.org/10.1016/j.neucom.2016.07.080]
[PMID: 14571374]
[http://dx.doi.org/10.1109/TCBB.2011.151] [PMID: 22084149]
[http://dx.doi.org/10.1093/bioinformatics/bth267] [PMID: 15087314]
[http://dx.doi.org/10.1016/j.jbi.2011.01.001] [PMID: 21241823]
[http://dx.doi.org/10.2174/1574893615999200503030350]
[http://dx.doi.org/10.1038/srep39832] [PMID: 28045061]
[http://dx.doi.org/10.1093/bioinformatics/bts145] [PMID: 22467913]
[http://dx.doi.org/10.1093/bib/bbab227] [PMID: 34113984]
[http://dx.doi.org/10.1093/bib/bbab094] [PMID: 33834198]
[http://dx.doi.org/10.1093/bib/bbab008] [PMID: 33529337]
[http://dx.doi.org/10.1016/j.compbiomed.2021.104392] [PMID: 33895458]
[http://dx.doi.org/10.1007/s11280-018-0600-3]
[http://dx.doi.org/10.18653/v1/D18-1310]
[http://dx.doi.org/10.1093/bib/bbab434] [PMID: 34750626]
[http://dx.doi.org/10.1016/j.csbj.2021.03.015] [PMID: 33868598]
[http://dx.doi.org/10.1093/nar/gkx787] [PMID: 28981699]
[http://dx.doi.org/10.1186/s12859-016-0990-0] [PMID: 27006077]
[http://dx.doi.org/10.1155/2020/8571932] [PMID: 32904605]
[http://dx.doi.org/10.1155/2020/2547902] [PMID: 32351986]
[http://dx.doi.org/10.1093/nar/30.1.207] [PMID: 11752295]
[http://dx.doi.org/10.1186/1471-2164-15-151] [PMID: 24559495]
[http://dx.doi.org/10.1200/CCI.18.00138] [PMID: 31002562]
[http://dx.doi.org/10.1016/j.ygeno.2013.11.001] [PMID: 24239985]
[http://dx.doi.org/10.1136/amiajnl-2013-002516] [PMID: 24853067]
[PMID: 33381830]
[PMID: 33099604]
[http://dx.doi.org/10.1016/j.cmpb.2020.105458] [PMID: 32302875]
[http://dx.doi.org/10.1016/j.compbiomed.2020.104089] [PMID: 33338982]
[http://dx.doi.org/10.1016/j.cmpb.2020.105400] [PMID: 32179311]
[http://dx.doi.org/10.1093/bioinformatics/bty796] [PMID: 30202885]
[http://dx.doi.org/10.1214/aos/1176345513]
[http://dx.doi.org/10.1080/01431160412331269698]
[PMID: 31721599]
[PMID: 9647199]
[http://dx.doi.org/10.3390/ani10101739] [PMID: 32987958]
[http://dx.doi.org/10.7150/thno.46082] [PMID: 32685017]
[http://dx.doi.org/10.1093/bioinformatics/18.12.1593] [PMID: 12490443]
[http://dx.doi.org/10.3390/ani11010106] [PMID: 33430282]
[http://dx.doi.org/10.1016/j.cmpb.2019.105292] [PMID: 31923818]
[http://dx.doi.org/10.1155/2016/1058305] [PMID: 27127506]
[http://dx.doi.org/10.1021/acs.jcim.9b00300] [PMID: 31403793]
[http://dx.doi.org/10.1038/s41598-017-13259-6] [PMID: 29026108]
[http://dx.doi.org/10.3389/fgene.2019.00899] [PMID: 31632436]
[http://dx.doi.org/10.3389/fgene.2019.00770] [PMID: 31616461]
[http://dx.doi.org/10.1186/s13148-019-0686-1] [PMID: 31182156]
[http://dx.doi.org/10.3390/biom8030074] [PMID: 30127295]
[PMID: 33048108]
[PMID: 31591638]
[http://dx.doi.org/10.1016/j.dib.2017.02.006] [PMID: 28229117]
[http://dx.doi.org/10.1371/journal.pone.0233713] [PMID: 32497068]
[http://dx.doi.org/10.3390/v12010016] [PMID: 31861938]