Abstract
Background: The accurate classification of microarray data has been a great challenge in machine learning due to its high dimensionality and small number of samples. Feature selection is an effective way to deal with such data.
Objective: Feature subset that maximizes feature-feature diversity as well as feature-class relevance is selected to improve the predictive efficiency and reduce the cost of feature acquisition. Moreover, the selection of features with high entropy but low classification performance is restricted.
Method: We first present a feature selection criterion based on information distance measure by introducing the self-redundancy factor into the maximum relevance and maximum redundancy criterion, where the self-redundancy factor is taken as the penalty for feature with high entropy; then, an incremental search based feature selection method using this criterion called MFFID is proposed to maximize the information distance between features.
Results: Compared with four representative feature selection methods on twelve high-dimensional microarray datasets, the proposed method MFFID achieves better performance than the other methods in terms of the classification accuracy.
Conclusion: In this study, a novel feature selection method MFFID is proposed, which is expressed in the form of information distance measure by introducing the self-redundancy factor into CMRMR. The experimental results clearly demonstrate that MFFID is an effective and stable feature selection method for the tumor datasets classification.
Keywords: Classification, feature selection, information distance measure, diversity, entropy, accuracy.
Graphical Abstract