Abstract
High throughput screening (HTS) remains a very costly process notwithstanding many recent technological advances in the field of biotechnology. In this study we consider the application of machine learning methods for predicting experimental HTS measurements. Such a virtual HTS analysis can be based on the results of real HTS campaigns carried out with similar compounds libraries and similar drug targets. In this way, we analyzed Test assay from McMaster University Data Mining and Docking Competition [1] using binary decision trees, neural networks, support vector machines (SVM), linear discriminant analysis, k-nearest neighbors and partial least squares. First, we studied separately the sets of molecular and atomic descriptors in order to establish which of them provides a better prediction. Then, the comparison of the six considered machine learning methods was made in terms of false positives and false negatives, methods sensitivity and enrichment factor. Finally, a variable selection procedure allowing one to improve the methods sensitivity was implemented and applied in the framework of polynomial SVM.
Keywords: CART, decision trees, drug target, hit, k-nearest neighbors (kNN), linear discriminant analysis (LDA), neural networks (NN), partial least squares (PLS), ROC curve, sampling, support vector machines (SVM), virtual high throughput screening