Abstract
A novel method is presented for predicting β-hairpin motifs in protein sequences. That is Random Forest algorithm on the basis of the multi-characteristic parameters, which include amino acids component of position, hydropathy component of position, predicted secondary structure information and value of auto-correlation function. Firstly, the method is trained and tested on a set of 8,291 β-hairpin motifs and 6,865 non-β-hairpin motifs. The overall accuracy and Matthews correlation coefficient achieve 82.2% and 0.64 using 5-fold cross-validation, while they achieve 81.7% and 0.63 using the independent test. Secondly, the method is also tested on a set of 4,884 β-hairpin motifs and 4,310 non-β- hairpin motifs which is used in previous studies. The overall accuracy and Matthews correlation coefficient achieve 80.9% and 0.61 for 5-fold cross-validation, while they achieve 80.6% and 0.60 for the independent test. Compared with the previous, the present result is better. Thirdly, 4,884 β-hairpin motifs and 4,310 non-β-hairpin motifs selected as the training set, and 8,291 β-hairpin motifs and 6,865 non-β-hairpin motifs selected as the independent testing set, the overall accuracy and Matthews correlation coefficient achieve 81.5% and 0.63 with the independent test.
Keywords: Amino acids component of position, auto-correlation function, β-hairpin motif, hydropathy component of position, predicted secondary structure information, random forest algorithmAmino acids component of position, auto-correlation function, β-hairpin motif, hydropathy component of position, predicted secondary structure information, random forest algorithm