Abstract
Hepatocellular carcinoma (HCC) is the most common type of liver cancer worldwide and mostly occurs in viral hepatitis endemic areas such as China. Knowledge of HCC-related genes may lead to an early detection of HCC and develop molecularly targeted therapeutics, reducing mortality and improving a patient’s prognosis significantly. Therefore, it is valuable and important for us to identify common characters of HCC related genes. In this study, we proposed a computational method to predict HCC related genes based on Gene Ontology terms and KEGG terms using Random Forest (RF), in which features were optimized by maximum relevance minimum redundancy (mRMR) and incremental feature selection (IFS). 224 HCC gene candidates were compiled from some databases, while 11,200non-HCC gene candidates were randomly selected from Ensemble database. 10 candidate datasets were constructed by dividing non-HCC gene candidates into 10 groups. Each gene in datasets was encoded by 13,126 features including 12,887 Gene Ontology enrichment scores and 239 KEGG enrichment scores. Finally, an optimal feature set including 615 GO terms and 11 KEGG pathways was discovered. Through analysis, we found these features were closely related to HCC, which means our method is effective for discovering HCC related genes, and it is hopeful that it can also be used to predict and analyze genes for other types of cancer.
Keywords: Gene ontology, hepatocellular carcinoma (HCC), incremental feature selection (IFS), KEGG, maximum relevance minimum redundancy (mRMR), random forest (RF).
Graphical Abstract