Abstract
The magnitude of the challenges in preclinical drug discovery is evident in the large amount of capital invested in such efforts in pursuit of a small static number of eventually successful marketable therapeutics. An explosion in the availability of potentially drug-like compounds and chemical biology data on these molecules can provide us with the means to improve the eventual success rates for compounds being considered at the preclinical level, but only if the community is able to access available information in an efficient and meaningful way. Thus, chemical database resources are critical to any serious drug discovery effort. This paper explores the basic principles underlying the development and implementation of chemical databases, and examines key issues of how molecular information may be encoded within these databases so as to enhance the likelihood that users will be able to extract meaningful information from data queries. In addition to a broad survey of conventional data representation and query strategies, key enabling technologies such as new context-sensitive chemical similarity measures and chemical cartridges are examined, with recommendations on how such resources may be integrated into a practical database environment.
Keywords: Chemical cartridge, chemical database, fingerprints, frequent subgraphs, hashing, molecular structure representation, similarity.