Abstract
Rapidly developing next-generation sequencing technologies significantly promote metagenomics research, yet also present extreme challenges in the analysis of metagenomic data. Metagenomic samples can contain thousands of microbial species, thus, sequencing datasets can contain fragments from thousands of different genomes. Therefore, clustering the sequencing reads with their original genomes, namely, binning, is usually done to expedite further studies. Currently, binning methods are divided into two categories: supervised methods (which require reference genomes), and unsupervised methods (which do not).
We present an unsupervised binning method that combines a novel sequence feature recognition method with a spectral clustering algorithm. The sequence feature is a hybrid of sequence correlation and sequence composition analyses. Simulation experiments, based on simulated and actual metagenomic datasets, suggest that the combination of sequence composition and an intrinsic correlation of oligonucleotides, both extracted from tetranucleotide analyses, performs better than any single feature. A spectral clustering algorithm, which is a high performance unsupervised clustering method, is also applied in our binning method. The method is available as an open source package called HSS-bin (Hybrid Sequence feature and Spectral clustering unsupervised metagenomic binning) at http://bioinfo.seu.edu.cn/HSS-bin/.
We evaluated HSS-bin’s performance using both simulated and actual metagenomic datasets. Experimental results indicate that HSS-bin can handle metagenomic sequencing data with non-uniform species abundance, short sequences, and complex phylogenetic diversity with high accuracy. Our method performs well on actual metagenomic datasets and on datasets simulated from a complex metagenomic community.
Keywords: Metagenomics, unsupervised binning, sequence features, spectral clustering.