Abstract
Background: The classification of phenotypes on microarray data has drawn much attention in last few years. The known methods mainly focused on the selection or construction of features based on either genes or gene pairs on continuous-value gene expression data. However, few researches have been implemented to identify useful features based on both genes and gene pairs on binary-value gene expression data.
Objective: In this work, we proposed a new algorithm, called FSGGP, to select both feature genes and feature gene pairs on the binary-value gene expression data to improve two-phenotype classification.
Method: We calculated the uncertainty coefficient which represented how well a phenotype was described by a gene or gene pair under some possible relationship, and the exact relationship between the gene or gene pair and the phenotype was identified by the value of uncertainty coefficient. Furthermore, the closeness between genes or gene pairs and phenotypes was calculated, and the genes or gene pairs closely related with phenotypes were selected. The redundancy of genes and gene pairs as features was calculated by cross entropy on the binary data, and the redundant feature genes or gene pairs were eliminated. The optimal feature sets were obtained by the wrapper based forward feature selection for three classical classifiers.
Results: The algorithm was experimentally assessed on four public datasets. The results showed that algorithm FSGGP had better performance over four known feature selection algorithms based on either genes or gene pairs in terms of the average classification error rates.
Conclusion: We developed an algorithm to select both feature genes and feature gene pairs on the binaryvalue gene expression data, where the selection of feature gene pairs was implemented by identifying the higher logical relationship between gene pairs and phenotypes. The comparison with four known feature selection algorithms suggests that feature selection algorithms based on both genes and gene pairs can achieve better performance than feature selection algorithms based on either genes or gene pairs, and the identification of higher logical relationship is an effective approach for the selection of feature gene pairs.
Keywords: Classification, phenotype, gene, gene pair, microarray, binary-value gene expression data.
Graphical Abstract