Generic placeholder image

Current Genomics

Editor-in-Chief

ISSN (Print): 1389-2029
ISSN (Online): 1875-5488

Research Article

A Deep Clustering-based Novel Approach for Binning of Metagenomics Data

Author(s): Sharanbasappa D. Madival, Dwijesh Chandra Mishra*, Anu Sharma, Sanjeev Kumar, Arpan Kumar Maji, Neeraj Budhlakoti, Dipro Sinha and Anil Rai

Volume 23, Issue 5, 2022

Published on: 17 October, 2022

Page: [353 - 368] Pages: 16

DOI: 10.2174/1389202923666220928150100

Price: $65

Abstract

Background: One major challenge in binning Metagenomics data is the limited availability of reference datasets, as only 1% of the total microbial population is yet cultured. This has given rise to the efficacy of unsupervised methods for binning in the absence of any reference datasets.

Objective: To develop a deep clustering-based binning approach for Metagenomics data and to evaluate results with suitable measures.

Methods: In this study, a deep learning-based approach has been taken for binning the Metagenomics data. The results are validated on different datasets by considering features such as Tetra-nucleotide frequency (TNF), Hexa-nucleotide frequency (HNF) and GC-Content. Convolutional Autoencoder is used for feature extraction and for binning; the K-means clustering method is used.

Results: In most cases, it has been found that evaluation parameters such as the Silhouette index and Rand index are more than 0.5 and 0.8, respectively, which indicates that the proposed approach is giving satisfactory results. The performance of the developed approach is compared with current methods and tools using benchmarked low complexity simulated and real metagenomic datasets. It is found better for unsupervised and at par with semi-supervised methods.

Conclusion: An unsupervised advanced learning-based approach for binning has been proposed, and the developed method shows promising results for various datasets. This is a novel approach for solving the lack of reference data problem of binning in metagenomics.

Keywords: Binning, Convolutional Autoencoder, Deep clustering, Metagenomics, Genomic features, K-means

« Previous
Graphical Abstract

[1]
Handelsman, J. Metagenomics: Application of genomics to uncultured microorganisms. Microbiol. Mol. Biol. Rev., 2004, 68(4), 669-685.
[http://dx.doi.org/10.1128/MMBR.68.4.669-685.2004] [PMID: 15590779]
[2]
Meyer, F.; Paarmann, D.; D’Souza, M.; Olson, R.; Glass, E.M.; Kubal, M.; Paczian, T.; Rodriguez, A.; Stevens, R.; Wilke, A.; Wilkening, J.; Edwards, R.A. The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics, 2008, 9(1), 386.
[http://dx.doi.org/10.1186/1471-2105-9-386] [PMID: 18803844]
[3]
Alneberg, J.; Bjarnason, B. S.; de Bruijn, I.; Schirmer, M.; Quick, J.; Ijaz, U. Z.; Quince, C. CONCOCT: Clustering contigs on coverage and composition. arXiv, 2013, 2013, 1312.4038.
[4]
Gelfand, M.S.; Koonin, E.V. Avoidance of palindromic words in bacterial and archaeal genomes: A close connection with restriction enzymes. Nucleic Acids Res., 1997, 25(12), 2430-2439.
[http://dx.doi.org/10.1093/nar/25.12.2430] [PMID: 9171096]
[5]
Teeling, H.; Waldmann, J.; Lombardot, T.; Bauer, M.; Glöckner, F. TETRA: A web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics, 2004, 5(1), 163-169.
[http://dx.doi.org/10.1186/1471-2105-5-163] [PMID: 15507136]
[6]
Abe, T.; Sugawara, H.; Kanaya, S.; Kinouchi, M.; Ikemura, T. Self-Organizing Map (SOM) unveils and visualizes hidden sequence characteristics of a wide range of eukaryote genomes. Gene, 2006, 365, 27-34.
[http://dx.doi.org/10.1016/j.gene.2005.09.040] [PMID: 16364569]
[7]
Kislyuk, A.; Bhatnagar, S.; Dushoff, J.; Weitz, J.S. Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinformatics, 2009, 10(1), 316.
[http://dx.doi.org/10.1186/1471-2105-10-316] [PMID: 19799776]
[8]
Sharma, A.; Mishra, D.C.; Budhlakoti, N.; Rai, A.; Lal, S.B.; Kumar, S. Algorithmic and computational comparison of metagenome assemblers. Indian J. Agric. Sci., 2020, 90, 5.
[9]
Chatterji, S.; Yamazaki, I.; Bai, Z.; Eisen, J.A. CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads. arXiv, 2007, 2007, 0708.3098.
[http://dx.doi.org/10.1007/978-3-540-78839-3_3]
[10]
Alcaraz, L.D.; Belda-Ferre, P.; Cabrera-Rubio, R.; Romero, H.; Simón-Soro, A.; Pignatelli, M.; Mira, A. Identifying a healthy oral microbiome through metagenomics. Clin. Microbiol. Infect., 2012, 18(Suppl. 4), 54-57.
[http://dx.doi.org/10.1111/j.1469-0691.2012.03857.x] [PMID: 22647051]
[11]
Cox, M.A.; Cox, T.F. Multidimensional scaling. In: Handbook of data visualisation; Springer, 2008; pp. 315-347.
[http://dx.doi.org/10.1007/978-3-540-33037-0_14]
[12]
Kusuma, W.A.; Akiyama, Y. Metagenome fragment binning based on characterisation vectors; , 2011. Available from: https://repository.ipb.ac.id/handle/123456789/102901?show=full
[13]
Saghir, H.; Megherbi, D.B. An efficient comparative machine learning-based metagenomics binning technique via using Random forest. In: IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA), 2013 15-17 July;Milan, Italy, pp. 191-196.
[http://dx.doi.org/10.1109/CIVEMSA.2013.6617419]
[14]
Fiannaca, A.; La Paglia, L.; La Rosa, M.; Lo Bosco, G.; Renda, G.; Rizzo, R.; Gaglio, S.; Urso, A. Deep learning models for bacteria taxonomic classification of metagenomic data. BMC Bioinformatics, 2018, 19(Suppl. 7), 198.
[http://dx.doi.org/10.1186/s12859-018-2182-6] [PMID: 30066629]
[15]
Guo, X.; Liu, X.; Zhu, E.; Yin, J. Deep Clustering with Convolutional Autoencoders. In: Neural Information Processing; Liu, D.; Xie, S.; Li, Y.; Zhao, D.; El-Alfy, E.S., Eds.; Lecture Notes in Computer ScienceSpringer Cham, 2017; 10635, pp. 373-382.
[16]
Temperton, B.; Giovannoni, S.J. Metagenomics: Microbial diversity through a scratched lens. Curr. Opin. Microbiol., 2012, 15(5), 605-612.
[http://dx.doi.org/10.1016/j.mib.2012.07.001] [PMID: 22831844]
[17]
Sharon, I.; Morowitz, M.J.; Thomas, B.C. Time series community genomics analysis reveals rapid shifts. Genome Res., 2013, 23(1), 111-120.
[18]
Herath, D.; Tang, S.L.; Tandon, K.; Ackland, D.; Halgamuge, S.K. CoMet: A workflow using contig coverage and composition for binning a metagenomic sample with high precision. BMC Bioinformatics, 2017, 18(Suppl. 16), 571.
[http://dx.doi.org/10.1186/s12859-017-1967-3] [PMID: 29297295]
[19]
Sinha, D.; Sharma, A.; Mishra, D.C.; Rai, A.; Lal, S.B.; Kumar, S.; Farooqi, M.S.; Chaturvedi, K.K. MetaConClust - Unsupervised Binning of Metagenomics Data using Consensus Clustering. Curr. Genomics, 2022, 23(2), 137-146.
[http://dx.doi.org/10.2174/1389202923666220413114659]
[20]
Richard, G.; Grossin, B.; Germaine, G.; Hébrail, G.; de Moliner, A. A. Autoencoder-based time series clustering with energy applications. arXiv, 2020, 2020, 2002.03624.
[21]
Gulli, A.; Pal, S. Deep learning with Keras; Packt Publishing Ltd.: Birmingham, UK, 2017.
[22]
Hahsler, M.; Piekenbrock, M.; Doran, D. dbscan: Fast density-based clustering with R. J. Stat. Softw., 2019, 91(1), 1-30.
[http://dx.doi.org/10.18637/jss.v091.i01]
[23]
Aggarwal, D.; Sharma, D. Application of clustering for student result analysis. Int. J. Recent Technol. Eng., 2019, 7(6), 50-53.
[24]
Serra, A.; Tagliaferri, R. Unsupervised learning: Clustering 2019. Available form: https://towardsdatascience.com/unsupervised-learning-and-data-clustering-eeecb78b422a
[25]
van der Walt, S.; Colbert, S.C.; Varoquaux, G. The NumPy array: A structure for efficient numerical computation. Comput. Sci. Eng., 2011, 13(2), 22-30.
[http://dx.doi.org/10.1109/MCSE.2011.37]
[26]
McKinney, W. Pandas: A foundational Python library for data analysis and statistics. Seman. Scholor, 2011, 14(9), 61539023.
[27]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Duchesnay, E. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 2011, 12, 2825-2830.
[28]
Ari, N.; Ustazhanov, M. Matplotlib in Python. 11th International Conference on Electronics, Computer and Computation (ICECCO), 2014, 2014, p. 6997585.
[http://dx.doi.org/10.1109/ICECCO.2014.6997585]
[29]
Chen, W.C. A Quick Guide for the phyclust Package; Iowa State University: Ames, IA, USA, 2010.
[30]
Kuhn, M.; Wing, J.; Weston, S.; Williams, A.; Keefer, C.; Engelhardt, A.; Team, R.C. Package ‘caret’. R J., 2020, 223, 7.
[31]
Robin, X.; Turck, N.; Hainard, A.; Tiberti, N.; Lisacek, F.; Sanchez, J.C.; Müller, M. pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 2011, 12(1), 77.
[http://dx.doi.org/10.1186/1471-2105-12-77] [PMID: 21414208]
[32]
Wu, Y.W.; Tang, Y.H.; Tringe, S.G.; Simmons, B.A.; Singer, S.W. MaxBin: An automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome, 2014, 2(1), 26.
[http://dx.doi.org/10.1186/2049-2618-2-26] [PMID: 24468033]
[33]
Kang, D.D.; Froula, J.; Egan, R.; Wang, Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ, 2015, 3, e1165.
[http://dx.doi.org/10.7717/peerj.1165] [PMID: 26336640]
[34]
Robin, X.; Turck, N.; Hainard, A.; Tiberti, N.; Lisacek, F.; Sanchez, J.C.; Müller, M. pROC: an open-source package for R and S to analyze and compare ROC curves. BMC Bioinformatics, 2011, 12(1), 1-8.

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy