Abstract
Protein solubility plays a major role for understanding the crystal growth and crystallization process of protein. How to predict the propensity of a protein to be soluble or to form inclusion body is a long but not fairly resolved problem. After choosing almost 10,000 protein sequences from NCBI database and eliminating the sequences with 90% homologous similarity by CD-HIT, 5692 sequences remained. By using Chous pseudo amino acid composition features, we predict the soluble protein with the three methods: support vector machine (SVM), back propagation neural network (BP Neural Network) and hybrid method based on SVM and BP Neural Network, respectively. Each method is evaluated by the re-substitution test and 10-fold cross-validation test. In the re-substitution test, the BP Neural Network performs with the best results, in which the accuracy achieves 92.88% and Matthews Correlation Coefficient (MCC) achieves 0.8513. Meanwhile, the other two methods are better than BP Neural Network in 10-fold cross-validation test. The hybrid method based on SVM and BP Neural Network is the best. The average accuracy is 86.78% and average MCC is 0.7233. Although all of the three methods achieve considerable evaluations, the hybrid method is deemed to be the best, according to the performance comparison.
Keywords: Amino acid composition, neural network, hybrid approach, prediction, protein solubility, support vector machine, NCBI database, Chou's pseudo amino acid, CD-HIT, back propagation neural network, hybrid method, Matthews Correlation Coefficient, Escherichia Coli, Arg residues, cysteine fraction, proline fraction, GalNAc-transferase, serine hydrolases, human papillomaviruses, DNA-binding proteins, Isoleucine, Leucine, Valine, methionine, Arginine, Lysine, Aspartic acid, Glutamic acid, Asparagine, Glutamine, Histidine, Serine, Threonine, Proline, Alanine, Glycine, Cysteine, Phenylalanine, Artificial Neural Network, jackknife test, cross validation test