Abstract
Contemporary metabolomics experiments generate a rich array of complex high-dimensional data. Consequently, there have been concurrent efforts to develop methodological standards and analytical workflows to streamline the generation of meaningful biochemical and clinical inferences from raw data generated using an analytical platform like mass spectrometry. While such considerations have been frequently addressed in untargeted metabolomics (i.e., the broad survey of all distinguishable metabolites within a sample of interest), this methodological scrutiny has seldom been applied to data generated using commercial, targeted metabolomics kits. We suggest that this may, in part, account for past and more recent incomplete replications of previously specified biomarker panels. Herein, we identify common impediments challenging the analysis of raw, targeted metabolomic abundance data from a commercial kit and review methods to remedy these issues. In doing so, we propose an analytical pipeline suitable for the pre-processing of data for downstream biomarker discovery. Operational and statistical considerations for integrating targeted data sets across experimental sites and analytical batches are discussed, as are best practices for developing predictive models relating pre-processed metabolomic data to associated phenotypic information.
Keywords: Targeted metabolomics, Biomarker, Data Pre-processing, Machine learning, Exploratory data analysis, LOD, Small biological datasets.
Graphical Abstract