Abstract
In this paper, we propose a strategy to predict subcellular locations of human proteins using multi-step feature selection. Each protein is firstly coded by features derived from KEGG and GO enrichment scores. After an initial feature reduction, 9958 features remain and they are sorted by the Minimum Redundancy Maximum Relevance (mRMR) method. The sorted features are then filtered by an incremental feature selection (IFS) procedure and a compact set of features are obtained. Random forest (RF) is used as the prediction model and achieved an overall prediction accuracy of 67.72%, evaluated by ten-fold cross-validation. The corresponding KEGG pathways and GO terms of the resultant features are analyzed in-depth, and are deemed as the most important terms relating to human protein subcellular location.
Keywords: Subcellular location, minimum redundancy maximum relevance, incremental feature selection, random forest algorithm, ten-fold crossvalidation.