Abstract
The initial stage of drug development is the hit (active) compound search from a pool of millions of compounds; for this process, in silico (virtual) screening has been successfully applied. One of the problems of in silico screening, however, is the low hit ratio in relation to the high computational cost and the long CPU time. This problem becomes serious in structure-based in silico screening. The major reason is the low accuracy of the estimation of proteincompound binding free energy. The problem of ligand-based in silico screening is that the conventional quantitative structure- activity relationship (QSAR) approach is not effective at predicting new hit compounds with new scaffolds. Recently, machine-learning approaches have been applied to in silico drug screening to overcome the above problems. We review here machine-learning approaches for both structure-based and ligand-based drug screening. Machine learning is used to improve database enrichment in two ways, namely by improving the docking score calculated by the protein-compound docking program and by calculating the optimal distance between the feature vectors of active and inactive compounds. Both approaches require compounds that are known to be active with respect to the target protein. In structure-based screening, the former approach is mainly used with a protein-compound affinity matrix. In ligand-based screening, both the former and latter approaches are used, and the latter approach can be applied to various kinds of descriptors, such as 1D/2D descriptors/fingerprints and the affinity fingerprint given by the protein-compound affinity matrix.
Keywords: Virtual screening, affinity fingerprint, machine learning, neural network model, support vector machine, decision tree, Bayesian model, self-organizing map