Abstract
As most high-throughput data, mass spec proteomics data are complex, noisy and incomplete. Additionally, in settings addressing questions about differential expression of proteins the data are usually represented by relatively few samples and a very large number of predictor variables, i.e., m/z peaks. These characteristics pose a significant challenge for most analysis methods. In addition, the preprocessing of the data remains an active research area having a great impact on the subsequent analysis steps. A wide range of algorithms have been proposed for both the pre-processing and the higher leve l analysis of proteomics data. They range from classical approaches to second generation algorithms, which aim at tackling some of the limitations of earlier methods. Many of the proposed algorithms have been reported to produce encouraging results. However, no single algorithm has emerged as a method of choice. This work provides a critical review of the recent approaches for pre-processing and higher level analysis of proteomics data. Also their strengths and limitations are evaluated. Emphasis is given on describing the most common and serious mistakes recorded in published differential proteomics studies. Moreover, the review provides guidance for choosing and correctly applying the appropriate algorithms according to our experience. Also hints for the design of novel algorithms, which will more effectively handle the specific characteristics and constrains of differential proteomics data are discussed.
Keywords: Differential proteomics, mass-spec data analysis, proteomic profile analysis, biomarker discovery, disease diagnostics