Abstract
Background: Metabolomics is a relatively new and dominant branch of bioinformatics. Metabolite expression level controls the phenotypic characteristics of any organism. Recently, breast cancer is the leading type of cancer in women across the world, accounting for 25% of all cases. In 2012, it was seen that due to breast cancer, there were 1.68 million cases and 522,000 deaths. Therefore, for drug discovery as well as for early disease status prediction, significant metabolites identification for breast cancer and correct classification of the breast cancer status through classification technique are very important for metabolomics data analysis.
Objective: The main objective of this paper is to identify significant metabolites (p-value<0.05) and state of the art classification technique for breast cancer prediction using metabolomics dataset.
Methods: Although there are several techniques to identify significant metabolites, here, we took Student's t-test and Kruskal-Wallis test for significant metabolites identification. To classify the breast cancer prediction, we considered five modern classification techniques- (i) Naive Bayes (NB) (ii) Support Vector Machine (SVM) (iii) Linear Discriminant Analysis (LDA) (iv) k-nearest neighbors algorithm (kNN) and (v) Random Forest (RF). We also measured the performances of the classification techniques through accuracy, sensitivity, specificity, Receiver Operating Characteristic (ROC) curve and area under the ROC curve etc.
Results: The performance measures of different classification techniques showed that random forest classifier produced higher accuracy, sensitivity, specificity and area under the ROC curve compared to the other classification techniques for breast cancer prediction using metabolomics dataset. The analytical results also showed that there are 24 significant (adjusted p-value < 0.05) metabolites influencing breast cancer.
Conclusion: On the basis of the experimental results, we could say that there are 24 breast cancer influencing metabolites and for breast cancer prediction as well as metabolomics data analysis, random forest is the state of the art and outlier-robust classifier among the five classification techniques.
Keywords: Naive Bayes, support vector machine, linear discriminant analysis, k-nearest neighbors algorithm, random forest, ROC curve.
Graphical Abstract