Abstract
Correct QSAR analysis requires reliable measured or calculated logP values, being logP the most frequently utilized and most important physico-chemical parameter in such studies. Since the publication of theoretical fundamentals of logP prediction, many commercial software solutions are available. These programs are all based on experimental data of huge databases therefore the predicted logP values are mostly acceptable - especially for known structures and their derivatives. In this study we critically reviewed the published methods and compared the predictive power of commercial softwares (CLOGP, KOWWIN, SciLogP / ULTRA) to each other and to our recently developed automatic QS(P)AR program. We have selected a very diverse set of 625 known drugs (98%) and drug-like molecules with experimentally validated logP values. We have collected 78 reported “outliers” as well, which could not be predicted by the “traditional” methods. We used these data in the model buildings and validations. Finally, we used an external validation set of compounds missing from public databases. We emphasized the importance of data quality, descriptor calculation and selection, and presented a general, reliable descriptor selection and validation technique for such kind of studies. Our method is based on the strictest mathematical and statistical rules, fully automatic and after the initial settings there is no option for user intervention. Three approaches were applied: multiple linear regression, partial least squares analysis and artificial neural network. LogP predictions with a multiple linear regression model showed acceptable accuracy for new compounds therefore it can be used for “in-silico-screening” and / or planning virtual / combinatorial libraries.
Keywords: logP prediction, drugs, neural network, lipophilicity, in-silico-screening, combinatorial libraries