Abstract
Ortholog clusters are very important for functional annotation and studies in comparative and evolutionary genomics. Their accuracy is, therefore, of considerable significance. However, it is very hard to calculate the accuracy of ortholog clusters because it takes too much time to compare every gene between both ortholog clusters due to huge search space in many clusters. This study presents a fast comparison algorithm designed to measure the accuracy of a set of predicted ortholog clusters (POCs) based on a standard set of reliable ortholog clusters (ROCs), which is manually curated. The first step of the method identifies sets of POCs and ROCs involved with overlapped genes using a procedure that searches and merges every element with a common ROC identification (ID) or a common POC ID recursively to reduce huge comparisons between both data sets in the following step, and the second step calculates similarity very quickly between POCs and ROCs by the least-move algorithm. Our approach is a fully-automated method for measuring the accuracy of a set of POCs based on Kegg Orthology (KO). In addition, 12 genomes were selected in different domains and used for comparing a similarity measure using our algorithm with a method to measure consistency, by which a POC is considered to be consistent if all genes of the POC belong to a ROC. This study concludes that the auxiliary process to reduce the great search space makes it very efficient to calculate the accuracy of similarity between ROCs and POCs and that our approach can provide more robust results than the current standard method based on the measurement of consistency.
Keywords: Ortholog cluster, accuracy measurement, similarity, consistency, least-move algorithm, orthomeasurer.