A New Approach of Outlier-robust Missing Value Imputation for Metabolomics Data Analysis

Nishith        Kumar; Md.    Aminul    Hoque; Md.        Shahjaman; S.M.    Shahinul    Islam; Md.    Nurul Haque   Mollah

doi:10.2174/1574893612666171121154655

Abstract

Background: Metabolomics data generation and quantification are different from other types of molecular “omics” data in bioinformatics. Mass spectrometry (MS) based (gas chromatography mass spectrometry (GC-MS), liquid chromatography mass spectrometry (LC-MS), etc.) metabolomics data frequently contain missing values that make some quantitative analysis complex. Typically metabolomics datasets contain 10% to 20% missing values that originate from several reasons, like analytical, computational as well as biological hazard. Imputation of missing values is a very important and interesting issue for further metabolomics data analysis.

Objective: This paper introduces a new algorithm for missing value imputation in the presence of outliers for metabolomics data analysis.

Method: Currently, the most well known missing value imputation techniques in metabolomics data are knearest neighbours (kNN), random forest (RF) and zero imputation. However, these techniques are sensitive to outliers. In this paper, we have proposed an outlier robust missing imputation technique by minimizing twoway empirical mean absolute error (MAE) loss function for imputing missing values in metabolomics data.

Results: We have investigated the performance of the proposed missing value imputation technique in a comparison of the other traditional imputation techniques using both simulated and real data analysis in the absence and presence of outliers.

Conclusion: Results of both simulated and real data analyses show that the proposed outlier robust missing imputation technique is better performer than the traditional missing imputation methods in both absence and presence of outliers.

Keywords: Metabolomics, missing data, missing value imputation, singular value decomposition (SVD), receiver operating characteristic (ROC) curve, support vector machine.

« Previous Next »

Graphical Abstract

[1] 
Gromski PS, Xu Y, Kotze HL, et al. Influence of missing values substitutes on multivariate analysis of metabolomics data. Metabolites  2014; 4(2): 433-52.
[2] 
Xia J, Psychogios N, Young N, Wishart DS. MetaboAnalyst: a web server for metabolomic data analysis and interpretation. Nucleic Acids Res  2009; 37(2): W652-60.
[3] 
Schafer JL, Graham JW. Missing data: our view of the state of the art. Psychol Methods  2002; 7(2): 147-77.
[4] 
Hrydziuszko O, Viant MR. Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline. Metabolomics  2012; 8(1): 161-74.
[5] 
Steuer R, Morgenthal K, Weckwerth W, Selbig J. A gentle guide to the analysis of metabolomic data In: Weckwerth W, EdMetabolomics Methods in Molecular Biology™ Vol 358.  Humana Press 2007; pp. 105-26.
[6] 
De Ligny CL, Nieuwdorp GH, Brederode WK, Hammers WE, van Houwelingen JC. An application of factor analysis with missing data. Technometrics  1981; 23(1): 91-5.
[7] 
Duran AL, Yang J, Wang L, Sumner LW. Metabolomics spectral formatting, alignment and conversion tools (MSFACTs). Bioinformatics  2003; 19(17): 2283-93.
[8] 
Little RJ, Rubin DB. Statistical analysis with missing data. John Wiley & Sons 2014.
[9] 
Shrive FM, Stuart H, Quan H, Ghali WA. Dealing with missing data in a multi-question depression scale: a comparison of imputation methods. BMC Med Res Methodol  2006; 6(1): 57.
[10] 
Stacklies W, Redestig H, Scholz M, Walther D, Selbig J. pcaMethods-a bioconductor package providing PCA methods for incomplete data. Bioinformatics  2007; 23(9): 1164-7.
[11] 
Walczak B, Massart DL. Dealing with missing data: Part I. Chemometr Intell Lab  2001; 58(1): 15-27.
[12] 
Walczak B, Massart DL. Dealing with missing data: Part II. Chemometr Intell Lab  2001; 58(1): 29-42.
[13] 
Goodacre R, Vaidyanathan S, Dunn WB, Harrigan GG, Kell DB. Metabolomics by numbers: acquiring and understanding global metabolite data. Trends Biotechnol  2004; 22(5): 245-52.
[14] 
Aittokallio T. Dealing with missing values in large-scale studies: microarray data imputation and beyond. Brief Bioinform  2010; 11(2): 253-64.
[15] 
Goodacre R, Broadhurst D, Smilde AK, et al. Proposed minimum
reporting standards for data analysis in metabolomics.
Metabolomics 2007; 3(3): 231-41. 
[16] 
Hair JF, Anderson RE, Babin BJ, Black WC. Multivariate data analysis: A global perspective. Upper Saddle River, NJ: Pearson 2010.
[17] 
Zhan X, Patterson AD, Ghosh D. Kernel approaches for differential expression analysis of mass spectrometry-based metabolomics data. BMC Bioinformatics  2015; 16(1): 77.
[18] 
Steinfath M, Groth D, Lisec J, Selbig J. Metabolite profile analysis: from raw data to regression and classification. Physiol Plant  2008; 132(2): 150-61.
[19] 
Lin TH. A comparison of multiple imputation with EM algorithm and MCMC method for quality of life missing data. Qual Quant  2010; 44(2): 277-87.
[20] 
Blanchet L, Smolinska A. Data fusion in metabolomics and proteomics for biomarker discovery.  Stat Anal Proteom 2016; pp. pp. 209-223.
[21] 
Troyanskaya O, Cantor M, Sherlock G, et al. Missing value estimation methods for DNA microarrays. Bioinformatics  2001; 17(6): 520-5.
[22] 
Stekhoven DJ, Bühlmann P. MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics  2012; 28(1): 112-8.
[23] 
Breiman L. Random forests. Mach Learn  2001; 45(1): 5-32.
[24] 
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B  1977; 1-38.
[25] 
Rubin DB. Multiple imputation for nonresponse in surveys. John Wiley & Sons 2004.
[26] 
McLachlan G, Krishnan T. The EM algorithm and extensions. John Wiley & Sons 2007.
[27] 
Roweis S. EM algorithms for PCA and SPCA. Adv Neural Inf Process Syst  1998; 626-32.
[28] 
Thanoon FH. Robust Regression by Least Absolute Deviations Method. Int J Stat Appl  2015; 5(3): 109-12.
[29] 
Tibshirani R, Hastie T. Outlier sums for differential gene expression analysis. Biostatistics  2007; 8(1): 2-8.
[30] 
Kotze HL, Armitage EG, Sharkey KJ, et al. A novel untargeted metabolomics correlation-based network analysis incorporating human metabolic reconstructions. BMC Syst Biol  2013; 7(1): 107.

Rights & Permissions Print Cite

Article Metrics

48

7

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893612666171121154655	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

A New Approach of Outlier-robust Missing Value Imputation for Metabolomics Data Analysis

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract