Abstract
Nuclear receptors are involved in multiple cellular signaling pathways that affect and regulate processes. Because of their physiology and pathophysiology significance, classification of nuclear receptors is essential for the proper understanding of their functions. Bhasin and Raghava have shown that the subfamilies of nuclear receptors are closely correlated with their amino acid composition and dipeptide composition [29]. They characterized each protein by a 400 dimensional feature vector. However, using high dimensional feature vectors for characterization of protein sequences will increase the computational cost as well as the risk of overfitting. Therefore, using only those features that are most relevant to the present task might improve the prediction system, and might also provide us with some biologically useful knowledge. In this paper a feature selection approach was proposed to identify relevant features and a prediction engine of support vector machines was developed to estimate the prediction accuracy of classification using the selected features. A reduced subset containing 30 features was accepted to characterize the protein sequences in view of its good discriminative power towards the classes, in which 18 are of amino acid composition and 12 are of dipeptide composition. This reduced feature subset resulted in an overall accuracy of 98.9% in a 5-fold cross-validation test, higher than 88.7% of amino acid composition based method and almost as high as 99.3% of dipeptide composition based method. Moreover, an overall accuracy of 93.7% was reached when it was evaluated on a blind data set of 63 nuclear receptors. On the other hand, an overall accuracy of 96.1% and 95.2% based on the reduced 12 dipeptide compositions was observed simultaneously in the 5-fold cross-validation test and the blind data set test, respectively. These results demonstrate the effectiveness of the present method.
Keywords: Nuclear receptor, feature selection, protein function, support vector machine, bioinformatics