Abstract
In systems biology, it is a great challenge for researchers to identify whether the given set of organic compounds can combine together and form a meaningful pathway. Fortunately, it becomes more and more feasible to address and solve such a problem with the rapidly accumulated information on various organisms. Based on the attainable information, a novel computational approach is proposed to investigate this problem by adopting the metabolic pathway of yeast as the subject of the study. And we produced a benchmark dataset with 13,736 pathways consisting of both valid and invalid pathways and identified the valid pathways among them. Each of these pathways was encoded into a numeric vector, consisting of three parts: graph property, chemical functional group, and chemical structural set. Methods of Minimum Redundancy Maximum Relevance and Incremental Feature Selection were utilized to select an optimal feature set, and Nearest Neighbor Algorithm was adopted as the classification model, while Jackknife Test was used to evaluate the model. As a result, an optimal feature set consisting of 16 features, which were able to identify the valid pathways most successfully, was obtained.
Keywords: Chemical functional group, chemical structural set, compound similarity, metabolic pathway, minimum redundancy maximum relevance, nearest neighbor algorithm, jackknife cross-validation, Encoding Methods, Metabolism