Abstract
Mass spectrometry based proteomics allow us to analyze complex mixtures of proteins from various biological samples in a high-throughput manner, in order to identify important proteomic patterns and hopefully novel disease biomarkers. However, as most omics data, mass spectrometry proteomics data are complex, noisy and incomplete. Additionally, the data are usually represented by relatively few samples and a very large number of predictor variables, i.e., m/z peaks. These characteristics pose a significant challenge for most computational analysis methods and in recent literature various alternatives have been proposed.
A typical mass spectrometry proteomics data analysis workflow consists of two major steps: preprocessing and higher level analysis. In the recent years, a wide range of algorithms have been proposed for both, varying from classical approaches to second generation algorithms. Many of the proposed algorithms have been reported to produce encouraging results. However, no common strategy has emerged as a method of choice and for each dataset different algorithms produce different results, making the evaluation of the algorithms practically impossible.
This work provides a critical review of the recent approaches for both preprocessing and higher level analysis of proteomics data. The strengths and limitations of each method are also presented and emphasis is given on describing the most common and serious mistakes recorded in published differential proteomics studies. Moreover, the review provides guidance for choosing and correctly applying the appropriate algorithms according to our experience and hints for the design of novel algorithms, which will more effectively handle the specific characteristics and constrains of differential proteomics data.
Keywords: Biomarker discovery, differential proteomics, disease diagnostics, MALDI/SELDI, preprocessing, mass spectrometry data analysis, proteomic profile analysis, mass spectrometry, preprocessing , Ant Colony Optimization, Artificial Neural Network, Correlation-Based Feature Selection, Continuous Wavelet Transform, Electrospray Ionization, Fast Fourier Transformation, Genetic algorithm, Liquid Chromatography