Abstract
Background: Thermophilic proteins can maintain good activity under high temperature, therefore, it is important to study thermophilic proteins for the thermal stability of proteins.
Objective: In order to solve the problem of low precision and low efficiency in predicting thermophilic proteins, a prediction method based on feature fusion and machine learning was proposed in this paper.
Methods: For the selected thermophilic data sets, firstly, the thermophilic protein sequence was characterized based on feature fusion by the combination of g-gap dipeptide, entropy density and autocorrelation coefficient. Then, Kernel Principal Component Analysis (KPCA) was used to reduce the dimension of the expressed protein sequence features in order to reduce the training time and improve efficiency. Finally, the classification model was designed by using the classification algorithm.
Results: A variety of classification algorithms was used to train and test on the selected thermophilic dataset. By comparison, the accuracy of the Support Vector Machine (SVM) under the jackknife method was over 92%. The combination of other evaluation indicators also proved that the SVM performance was the best.
Conclusion: Because of choosing an effectively feature representation method and a robust classifier, the proposed method is suitable for predicting thermophilic proteins and is superior to most reported methods.
Keywords: Thermophilic proteins, feature fusion, g-gap, entropy density, autocorrelation coefficient, KPCA, machine learning.
Graphical Abstract
[PMID: 25911946]
[http://dx.doi.org/10.1007/s12223-019-00710-6] [PMID: 31102141]
[PMID: 10775659]
[PMID: 11403885]
[PMID: 15688447]
[PMID: 17876820]
[PMID: 22851052]
[PMID: 21044646]
[PMID: 31164042]
[PMID: 20053844]
[http://dx.doi.org/10.1093/bib/bby090] [PMID: 30239587]
[http://dx.doi.org/10.1016/j.omtn.2019.08.008] [PMID: 31536883]
[PMID: 27565583]
[PMID: 22182488]
[http://dx.doi.org/10.1093/bib/bby053] [PMID: 29947743]
[PMID: 30275937]
[PMID: 26648527]
[PMID: 24130738]
[PMID: 31137222]
[PMID: 27291150]
[PMID: 31045538]
[PMID: 24316387]
[PMID: 26883492]
[PMID: 28968812]
[PMID: 25502053]
[http://dx.doi.org/10.1093/bib/bbz048] [PMID: 31157855]
[PMID: 30277150]
[http://dx.doi.org/10.1093/bioinformatics/btz358] [PMID: 31077296]
[PMID: 30428009]
[PMID: 30247625]
[PMID: 29989085]
[PMID: 30265280]
[PMID: 30698979]