Abstract
The availability of an increased number of fully sequenced genomes demands functional interpretation of the genomic information. Despite high throughput experimental techniques and in silico methods of predicting protein-protein interaction (PPI); the interactome of most organisms is far from completion. Thus, predicting the interactome of an organism is one of the major challenges in the post-genomic era. This manuscript describes Support Vector Machine (SVM) based models that have been developed for discriminating interacting and non-interacting pairs of proteins from their amino acid sequence. We have developed SVM models using various types of sequence compositions e.g. amino acid, dipeptide, biochemical property, split amino acid and pseudo amino acid composition. We also developed SVM models using evolutionary information in the form of Position Specific Scoring Matrix (PSSM) composition. We achieved maximum Matthews correlation coefficient (MCC) of 1.00, 0.52 and 0.74 for Escherichia coli, Saccharomyces cerevisiae, and Helicobacter pylori, using dipeptide based SVM model at default threshold. It was observed that the performance of a prediction model depends on the dataset used for training and testing. In case of E. coli MCC decreased from 1.0 to 0.67 when evaluated on a new dataset. In order to understand PPI in different cellular environment, we developed speciesspecific and general models. It was observed that species-specific models are more accurate than general models. We conclude that the primary amino acid sequence based descriptors could be used to differentiate interacting from noninteracting protein pairs. Some amino acids tend to be favored in interacting pairs than non-interacting ones. Finally, a web server has been developed for predicting protein-protein interactions.
Keywords: Protein interaction, protein sequence, support vector machine, interactome, protein interaction prediction, Position Specific Scoring Matrix, Matthews correlation coefficient (MCC), Escherichia coli, Saccharomyces cerevisiae, Helicobacter pylori, co-expression data analysis, pull-down assays, coimmunoprecipitation, tandem affinity purifi-cation, two hybrid-based methods, Mass spectrometry, protein chips, binding reaction methods, hy-brid approaches, phylogenetic profile, conservation of gene neighborhood, gene fusion, correlated mutations, signature product method, pair wise kernel methods, Genome context methods, predicted operons, random forest method, Bayesian network, split amino acid, Pseudo-amino acid based SVM model, con-verted sequence alphabet, biochemical monopeptide, dipeptide (BD) and, tripeptide, Non-Redundant Proteins in Interacting Pairs, Cross-Species Prediction, Web-based Prediction Server, ProPrint, Position-Specific Scoring Matrix (PSSM), Wrapper Based Attribute Selection