Abstract
Predicting the thermostability of a biomolecule, given its sequence, is one of the big challenges of protein engineering and developing tools to screen thermostable mutants is of great interest. Here we used various screening, clustering, decision tree and generalized rule induction models to search for patterns of thermostability. Arg was solely found as N-terminal amino acid in proteins at temperatures higher than 70°C. Fifty-four protein features were important in feature selection, and the number of peer groups (anomaly index 2.12) declined from 7 to 2 with selected features; no changes were found in K-Means and TwoStep clusters with/without feature selection filtering. Tree depths of decision tree models varied from 14 (in C5.0 with 10-fold cross-validation and with feature selection) to 4 (in CHAID) branches and C5.0 was the best and the Quest model was the worst. No significant difference in the performance of various decision tree models was found with/without feature selection, but the number of peer groups in clustering models was reduced significantly (p < 0.05). The frequency of Gln was the most important feature in decision tree rules and for all association rules in antecedent to support the rules. The importance of Gln in protein thermostability is discussed in this paper.
Keywords: Bioinformatics, modeling, protein, thermostability