Abstract
With the explosive growth of databanks consisting of protein sequences, there is an increasing need for annotating a number of newly discovered enzyme sequences. Given a protein sequence, the question arises on how to identify whether it is an enzyme or a non-enzyme? If it is an enzyme, and then which main functional class does it belong to? Since the biology experiment methods are both time-consuming and expensive, it is highly desired to develop an in silicon method to address these problems. In this paper, two effective methods are taken into consideration to constitute the 2-layer predictor: the 1st layer prediction engine respectively extracts 188-D features based on composition and physical-chemical property of protein and extract 20-D features by using position-specific scoring matrix (PSSM), for determining a query protein as an enzyme or a non-enzyme; the 2nd layer prediction engine extracts 20-D feature by PSSM and is designed for predicting the main family class of the enzyme. In our experiment, multifunctional enzymes due to their specific characterstics are viewed as the 7th category of enzyme. As a result, the accuracy of 1st layer prediction reaches 98.99% (188-D) and 98.25% (20-D) using 10-cross-validation, and for the 2nd layer prediction, 97.12% by Random Forest and 98.39% accuracy by IB1 are obtained. These high accuracies indicate that the current method could be an effective and promising high throughput method in the enzyme research. Furthermore, we developed an online web server which can be accessed via http://datamining.xmu.edu.cn:8080/PredictE/.
Keywords: Bioinformatics, enzyme family class, machine learning, multi-functional enzyme.