Abstract
Background: Currently, a major challenge is the treatment and interpretation of actual data. Data sets are often high-dimensional, have small number of observations and are noisy. Furthermore, in recent years, many approaches have been suggested for integrating continuous with categorical/ordinal data, in order to capture the information which is lost in independent studies.
Objective: The aim of this paper is to develop a statistical tool for the detection of outliers adapted to any kind of features and to high-dimensional data. Method: Data is an nxp data matrix (n<<p) where the rows correspond to observations, the columns correspond to any kind of features. The new procedure is based on the distances between all the observations and offers a ranking by assigning each observation a value reflecting its degree of outlyingness. It was evaluated by simulation and by using actual data from clinical and genetic studies. Results: The simulation studies showed that the procedure correctly identified the outliers, was robust in front of the masking effect and was useful in the detection of noise. With simulated two-sample microarray data sets, it correctly detected outliers, especially when many genes showed increased expression only for a small number of samples. The method was applied to adult lymphoid malignancies, human liver cancer and autism multiplex families’ data sets obtaining good and valuable results. Conclusion: The actual and simulation studies show the efficiency of the procedure, offering a useful tool in those applications where the detection of outliers or noise is relevant.Keywords: Biomedical data, data depth, gene expression, microarray, noise, outlier, robust estimation.
Graphical Abstract