Abstract
Prediction of thermophilic and mesophilic protein plays a crucial role in both biochemistry and bioengineering. In this study, a different mode of pseudo amino acid composition (PseAAC) was proposed to formulate the protein samples by integrating the amino acid composition, the physic chemical features, as well as the composition transition and distribution features, where each of the protein samples was represented by a numerical vector through the sequencebased approach. Using the support vector machine algorithm, an accurate and reliable classifier was constructed to predict the thermophilic and mesophilic proteins. Moreover, three feature reduction algorithms were obtained for locating the most vital features and reducing the size of feature space. Among the three feature reduction algorithms, the genetic algorithm performed best. Finally, with the reduced features extracted from the genetic algorithm, it was observed that for the selected dataset the new classifier achieved a high accuracy of 95.93% with the Matthews correlation coefficient of 0.9187.
Keywords: Amino acid composition, classifier, feature reduction, genetic algorithm, support vector machineAmino acid composition, classifier, feature reduction, genetic algorithm, support vector machine