Abstract
Background: Due to the ease of quantifying mRNA expression in comparison with that of protein abundances, many studies have utilized it to infer protein product quantification. However, the mRNA expression values for a gene and its protein products are not known to have a strong relationship, because of the complex mechanisms required to regulate the amounts of protein levels, from translation to post-translational modifications.
Methods: We have developed, in this study, models to predict protein levels from mRNA expression levels using the transcriptome and reverse phase protein arrays (RPPA)-based on protein levels in pancancer cell lines. When predicting the abundance of a protein expression, in addition to using RNA expression of the corresponding gene, we also used RNA expression levels of a particular set of other genes. By applying support vector regression, we have identified a 47-gene expression panel that contributes to the improved performance of the prediction, and its optimal subsets specific to each protein species.
Result and Conclusion: Eventually, our final prediction models doubled the number of predictable protein expressions (r > 0.7). Due to the weaknesses of RPPA, our model had some limitations, however, we expect that these prediction models and the panel can be widely used in the future to infer protein abundances.
Keywords: Protein abundance, gene expression, prediction model, support vector regression, reverse phase protein array, cancer cell line.
Graphical Abstract