Abstract
The past decade has seen a rapid growth in biomedical data from many fields: genetics, chemistry, pharmacology and medicine among others. Structured data integration within and among these fields exists to varying degrees, but unstructured data integration has existed since the dawning of science in the form of text-based published reports. The biomedical literature is vast, with Medline having approximately 15 million records and adding new records at a rate greater than one per minute. Medline contains a wealth of information about chemical compounds, interactions, side-effects, phenotypes, genetic interactions and disease studies. Computational methods are being designed to data mine these large bodies of unstructured text to infer what is not yet known based upon. Applied to drug discovery, this approach has become a potential means of shortcutting the traditional drug discovery “pipeline”, which has been estimated to take up to 15 years and cost approximately 1 billion US$ from target selection to FDA approval. These literature-based methods of knowledge discovery provide a means to identify candidate compounds to treat diseases and to identify genes that may play a role in rare, but extremely adverse reactions to promising new drugs that subsequently force their removal from the pipeline. This chapter discusses the use of literature-based sources of knowledge as a means of discovering novel connections between pharmacological entities such as diseases, drugs and genes.