Abstract
Objective: Small GTPase is an important molecular switch that plays an important role in numerous signaling transduction pathways, the aim is to explore its binary classification features with machine learning algorithms.
Methods: The sequences including small GTPases and non small GTPases were clustered to remove similar entries, respectively. Then, they were divided into 10 datasets, each containing equal entries of small GTPases and non small GTPases. These datasets extracted three feature vectors that included188- dimensional(188D), 400D, and motif-based features (608D). The next step was classification based on easy-classify.py software in scikit-learn, which integrated 12 classifiers and finally discovered the conserved motifs by MEME suite.
Results: The three best performed classifiers were logistic regression (LR), gradient boosting decision tree (GBDT), and bagging for 400D features, LibSVM, GBDT, and bagging for 188D features, and GBDT, bagging, and AdaBoost for 608D features, respectively. The top four classifiers were GBDT, bagging, LR, and AdaBoost according to commonly evaluated indices as a whole. GBDT obtained the highest area under the curve (AUC) value at 88.61%. The 400D features performed better than the 188D and 608D ones. Five conserved G-box motifs were discovered in the sequences of human small GTPases.
Conclusion: This study provides the first description of GBDT algorithm performed best for small GTPases classification.
Keywords: Small GTPase, binary-class classification, feature vector, gradient boosting decision tree (GBDT), scikit-learn method, motif.
Graphical Abstract