Abstract
Background: The residue interaction network contains a large amount of protein three dimensional spatial information determined by sequence characteristics. It is an effective way to study protein thermosability from network perspective.
Objective: We use residue interaction network information to improve the performance of machine learning methods trained to discriminate thermostable proteins from mesophilic proteins. Method: We compared Support Vector Machines (SVM), BayesNet, Artificial Neural Network (ANN) and Logistic Regression (LR) and selected the best machine learning method to identify thermostable proteins from mesophilic ones. Results: After combining the residue network topology parameters (the average connection strength, average degree, characteristic path length, clustering coefficient, weighted clustering coefficient, closeness centrality, residue centrality) with sequence characteristics as feature vectors, we found the SVM-based method gave better performance, and the average discrimination accuracy of five-fold cross validation of SVM increased to 87.5% compared with the result using sequence characteristics as feature vectors. 89.71% of mesophilic proteins were classified correctly, and 85.29% of thermophilic proteins were classified correctly. Conclusion: We found the characteristic path length and closeness centrality greatly improved the discrimination rate of thermophilic proteins. The main reason is thermophilic proteins have more rigid structure, highly stable and strong interaction between residues, which causes them to have shorter characteristic path length and closeness centrality. Residue network characteristics offer an innovation and reliable method for identifying and analyzing the factors related to the protein thermostability.Keywords: Residue interaction network, thermostability, Support Vector Machines, characteristic path length, closeness centrality, sequence characteristics.
Graphical Abstract