Abstract
Background: Batch effects are usually introduced in gene expression data, which can dramatically reduce the accuracy of statistical inference in the genomic analysis since samples in different batches cannot be directly comparable.
Objective: To accurately measure biological variability and obtain correct statistical inference, we considered to correct/remove the batch effects for merging the samples from different batches into a comparable dataset for high-throughput genomic analysis.
Methods: The existing L/S model uses the empirical Bayes methods to find the constant values for multiplication/addition for each gene. Different from the L/S model, we used the dimensionality reduction method. We proposed an effective scaling method to scale each gene by multiplying a constant value, which was formulated as an optimization problem based on spectral clustering. The data samples from different batches can be merged into a comparable dataset with batch effect correction. Furthermore, we proposed an approximation solution to solve the optimization problem for the scaling adjustment values.
Results: We evaluated the proposed method on both artificial and gene expression datasets by comparing it with the existing well-established batch effect correction methods. Numerical experiments show that the proposed method projects the data samples from different batches to resemble each other and outperforms the others on both microarray and single-cell RNA-seq datasets.
Conclusion: The scaling adjustment for genes and dimensionality reduction improved the accuracy and removed the batch effects, thereby making the proposed method more robust for interfering genes.
Keywords: Batch effects, spectral clustering, scaling adjustment, dimensionality reduction, microarray, single-cell RNA-seq.
Graphical Abstract