Abstract
Advancements in sequencing techniques place personalized genomic medicine upon the horizon, bringing along the responsibility of clinicians to understand the likelihood for a mutation to cause disease, and of scientists to separate etiology from nonpathologic variability. Pathogenicity is discernable from patterns of interactions between a missense mutation, the surrounding protein structure, and intermolecular interactions. Physicochemical stability calculations are not accessible without structures, as is the case for the vast majority of human proteins, so diagnostic accuracy remains in infancy. To model the effects of missense mutations on functional stability without structure, we combine novel protein sequence analysis algorithms to discern spatial distributions of sequence, evolutionary, and physicochemical conservation, through a new approach to optimize component selection. Novel components include a combinatory substitution matrix and two heuristic algorithms that detect positions which confer structural support to interaction interfaces. The method reaches 0.91 AUC in ten-fold cross-validation to predict alteration of function for 6,392 in vitro mutations. For clinical utility we trained the method on 7,022 disease associated missense mutations within the Online Mendelian inheritance in man amongst a larger randomized set. In a blinded prospective test to delineate mutations unique to 186 patients with craniosynostosis from those in the 95 highly variant Coriell controls and 2000 control chromosomes, we achieved roughly 1/3 sensitivity and perfect specificity. The component algorithms retained during machine learning constitute novel protein sequence analysis techniques to describe environments supporting neutrality or pathology of mutations. This approach to pathogenetics enables new insight into the mechanistic relationship of missense mutations to disease phenotypes in our patients.
Keywords: Computational biology, protein stability, machine learning, missense mutation, nonsynonymous SNP, sequence analysis, quence analysis algorithms, discern spatial distributions of sequence, novel protein, physicochemical conservation, heuristic algorithms, Amino Acid Substitution Matrices, Structural Analysis, Sequence Inferred Structural Analysis, SSR, The multivariate analysis of protein polymorphisms algo-rithm (MAPP), Str from Seq, Shells - [Novel], Fxn from Str, Sequence Independent Algorithms, Matrices - [Novel], Logistic Regression, Reverse Stepwise Logistic Regression, Support Vector Machine, Craniosynostosis Data Sets, Prospective Clinical Test, Retrospective Clinical Test, Ten Fold Cross Validation, The receiver operator characteristic (ROC), AUC, Two State Accuracy, Amino Acid Substitution Scoring Matrix, LacRepressor, Nonsynonymous SNP Amino Acid Substitution Scoring Matrix