PuzzleCluster: A Novel Unsupervised Clustering Algorithm for Binning DNA Fragments in Metagenomics

Kyler      Siegel; Kristen      Altenburger; Yu-Sing      Hon; Jessey      Lin; Chenglong      Yu

doi:10.2174/157489361002150518150716

Abstract

Metagenomic datasets are composed of DNA fragments from large numbers of different and potentially novel organisms. These datasets can contain up to several million sequences taken from heterogeneous populations of extremely varied abundance. Unlike traditional genomic studies, metagenomic analysis requires an additional binning step. This process groups DNA fragments from the same or similar species of origin. However, existing unsupervised metagenomic binning programs cannot accurately analyze datasets containing a large number of species or with significantly unbalanced abundance ratios. To improve upon these current limitations, we present PuzzleCluster, a novel unsupervised binning algorithm. PuzzleCluster incorporates a unique cluster refinement step by automatically grouping reads which share a nucleotide word (i.e. reverse complement pairs) of a predetermined length. Additionally, the clustering parameters are estimated by fitting the Jensen-Shannon distance among sequences using the expectation maximization algorithm. Since clustering parameters are computed based on each dataset, our approach can adapt to the peculiarities of each dataset and is not confined by universal parameters. Furthermore, PuzzleCluster utilizes no prior assumptions about the genetic makeup or number of organisms present in the sample, making it well-suited for applications with a large amount of biodiversity and completely unknown organisms. As a comparison, PuzzleCluster has an accuracy 9%, 19.8%, 15.7%, and 19.5% higher than MetaCluster 3.0 for taxonomic levels phylum, class, order, and family, respectively. PuzzleCluster source code is freely available at http://math.stanford.edu/~ksiegel/PuzzleCluster.html.

Keywords: Clustering, expectation maximization, Jensen-Shannon distance, metagenome, quality threshold algorithm, word agreement.

« Previous Next »

Graphical Abstract

Rights & Permissions Print Cite

Article Metrics

25

6

1

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/157489361002150518150716	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

PuzzleCluster: A Novel Unsupervised Clustering Algorithm for Binning DNA Fragments in Metagenomics

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract