Generic placeholder image

Combinatorial Chemistry & High Throughput Screening

Editor-in-Chief

ISSN (Print): 1386-2073
ISSN (Online): 1875-5402

Research Article

Taxonomy Classification using Genomic Footprint of Mitochondrial Sequences

Author(s): Aritra Mahapatra* and Jayanta Mukherjee

Volume 25, Issue 3, 2022

Published on: 10 December, 2021

Page: [401 - 413] Pages: 13

DOI: 10.2174/1386207324666210811102109

Price: $65

Abstract

Background: Advancement in sequencing technology yields a huge number of genomes from a multitude of organisms on our planet. One of the fundamental tasks for processing and analyzing these sequences is to organize them in the existing taxonomic orders.

Methods: Recently, we proposed a novel approach, GenFooT, for taxonomy classification using the concept of genomic footprint (GFP). The technique is further refined and enhanced in this work leading to improved accuracies in the task of taxonomic classification based on various benchmark datasets. GenFooT maps a genome sequence in a 2D coordinate space and extracts features from that representation. It uses two hyper-parameters, namely block size and number of fragments of genomic sequence while computing the feature. In this work, we propose an analysis of choosing values of those parameters adaptively from the sequences. The enhanced version of GenFooT is named GenFooT2.

Results: We have tested GenFooT2 on ten different biological datasets of genomic sequences of various organisms belonging to different taxonomy ranks. Our experimental results indicate a 3% improved classification performance of the proposed GenFooT2 featured with a Logistic regression classifier as compared to GenFooT. We also performed the statistical test to compare the performance of GenFooT2 to the state-of-the-art methods including our previous method, GenFooT.

Conclusion: The experimental results as well as the statistical test exhibit that the performance of the proposed GenFooT2 is significantly better.

Keywords: Taxonomy classification, mitochondrial genome, genomic footprint, alignment-free, shannon entropy, supervised classification, biological dataset

Graphical Abstract

[1]
Ruggiero, M.A.; Gordon, D.P.; Orrell, T.M.; Bailly, N.; Bourgoin, T.; Brusca, R.C.; Cavalier-Smith, T.; Guiry, M.D.; Kirk, P.M. A higher level classification of all living organisms. PLoS One 2015, 10(4) e0119248
[http://dx.doi.org/10.1371/journal.pone.0119248] [PMID: 25923521]
[2]
Kozlov, A.M.; Zhang, J.; Yilmaz, P.; Glöckner, F.O.; Stamatakis, A. Phylogeny-aware identification and correction of taxonomically mislabeled sequences. Nucleic Acids Res., 2016, 44(11), 5022-5033.
[http://dx.doi.org/10.1093/nar/gkw396] [PMID: 27166378]
[3]
Mora, C.; Tittensor, D.P.; Adl, S.; Simpson, A.G.; Worm, B. How many species are there on Earth and in the ocean? PLoS Biol., 2011, 9(8) e1001127
[http://dx.doi.org/10.1371/journal.pbio.1001127] [PMID: 21886479]
[4]
May, R.M. Why worry about how many species and their loss? PLoS Biol., 2011, 9(8) e1001130
[http://dx.doi.org/10.1371/journal.pbio.1001130] [PMID: 21886482]
[5]
Solow, A.R.; Mound, L.A.; Gaston, K.J. Estimating the Rate of Synonymy. Syst. Biol., 1995, 44(1), 93-96.
[http://dx.doi.org/10.2307/2413485]
[6]
Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 1997, 25(17), 3389-3402.
[http://dx.doi.org/10.1093/nar/25.17.3389] [PMID: 9254694]
[7]
Edgar, R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 2010, 26(19), 2460-2461.
[http://dx.doi.org/10.1093/bioinformatics/btq461] [PMID: 20709691]
[8]
Bao, Y.; Chetvernin, V.; Tatusova, T. Improvements to pairwise sequence comparison (PASC): a genome-based web tool for virus classification. Arch. Virol. 2014, 159(12), 3293-3304.
[http://dx.doi.org/10.1007/s00705-014-2197-x] [PMID: 25119676]
[9]
Lauber, C.; Gorbalenya, A.E. Partitioning the genetic diversity of a virus family: approach and evaluation through a case study of picornaviruses. J. Virol. 2012, 86(7), 3890-3904.
[http://dx.doi.org/10.1128/JVI.07173-11] [PMID: 22278230]
[10]
Bernt, M.; Braband, A.; Schierwater, B.; Stadler, P.F. Genetic aspects of mitochondrial genome evolution. Mol. Phylogenet. Evol., 2013, 69(2), 328-338.
[http://dx.doi.org/10.1016/j.ympev.2012.10.020] [PMID: 23142697]
[11]
Haubold, B. Alignment-free phylogenetics and population genetics. Brief. Bioinform., 2014, 15(3), 407-418.
[http://dx.doi.org/10.1093/bib/bbt083] [PMID: 24291823]
[12]
Huang, Y.; Wang, T. Phylogenetic analysis of DNA sequences with a novel characteristic vector. J. Math. Chem., 2011, 49(8), 1479-1492.
[http://dx.doi.org/10.1007/s10910-011-9811-x]
[13]
Zielezinski, A.; Vinga, S.; Almeida, J.; Karlowski, W.M. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol., 2017, 18(1), 186.
[http://dx.doi.org/10.1186/s13059-017-1319-7] [PMID: 28974235]
[14]
Matsen, F.A.; Kodner, R.B.; Armbrust, E.V. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 2010, 11(1), 538.
[http://dx.doi.org/10.1186/1471-2105-11-538] [PMID: 21034504]
[15]
Kosakovsky Pond, S.L.; Posada, D.; Stawiski, E.; Chappey, C.; Poon, A.F.; Hughes, G.; Fearnhill, E.; Gravenor, M.B.; Leigh Brown, A.J.; Frost, S.D. An evolutionary model-based algorithm for accurate phylogenetic breakpoint mapping and subtype prediction in HIV-1. PLOS Comput. Biol. 2009, 5(11) e1000581
[http://dx.doi.org/10.1371/journal.pcbi.1000581] [PMID: 19956739]
[16]
Ren, J.; Bai, X.; Lu, Y.Y.; Tang, K.; Wang, Y.; Reinert, G.; Sun, F. Alignment-free sequence analysis and applications. Annu. Rev. Biomed. Data Sci., 2018, 1, 93-114.
[http://dx.doi.org/10.1146/annurev-biodatasci-080917-013431] [PMID: 31828235]
[17]
Saw, A.K.; Raj, G.; Das, M.; Talukdar, N.C.; Tripathy, B.C.; Nandi, S. Alignment-free method for DNA sequence clustering using Fuzzy integral similarity. Sci. Rep., 2019, 9(1), 3753.
[http://dx.doi.org/10.1038/s41598-019-40452-6] [PMID: 30842590]
[18]
Siepel, A.; Haussler, D. Phylogenetic Hidden Markov models.Statistical Methods in Molecular Evolution; Springer, 2005, pp. 325-351.
[http://dx.doi.org/10.1007/0-387-27733-1_12]
[19]
Yang, W-F.; Yu, Z-G.; Anh, V. Whole genome/proteome based phylogeny reconstruction for prokaryotes using higher order Markov model and chaos game representation. Mol. Phylogenet. Evol. 2016, 96, 102-111.
[http://dx.doi.org/10.1016/j.ympev.2015.12.011] [PMID: 26724405]
[20]
Nandy, A.; Harle, M.; Basak, S.C. Mathematical descriptors of DNA sequences: development and applications. ARKIVOC, 2006, 2006(9), 211-238.
[http://dx.doi.org/10.3998/ark.5550190.0007.907]
[21]
Randić, M.; Novič, M.; Plavšić, D. Milestones in graphical bioinformatics. Int. J. Quantum Chem., 2013, 113(22), 2413-2446.
[http://dx.doi.org/10.1002/qua.24479]
[22]
Langille, M.G.; Hsiao, W.W.; Brinkman, F.S. Detecting genomic islands using bioinformatics approaches. Nat. Rev. Microbiol., 2010, 8(5), 373-382.
[http://dx.doi.org/10.1038/nrmicro2350] [PMID: 20395967]
[23]
Remita, M.A.; Halioui, A.; Malick Diouara, A.A.; Daigle, B.; Kiani, G.; Diallo, A.B. A machine learning approach for viral genome classification. BMC Bioinformatics, , 2017, 18(1), 208.
[http://dx.doi.org/10.1186/s12859-017-1602-3] [PMID: 28399797]
[24]
Struck, D.; Lawyer, G.; Ternes, A-M.; Schmit, J-C.; Bercoff, D.P. COMET: adaptive context-based modeling for ultrafast HIV-1 subtype identification. Nucleic Acids Res., 2014, 42(18), e144-e144.
[http://dx.doi.org/10.1093/nar/gku739] [PMID: 25120265]
[25]
Mahapatra, A.; Mukherjee, J. GenFooT: Genomic Footprint of mitochondrial sequence for Taxonomy classification. International Conference on Bioinformatics and Biomedicine, 2020.
[26]
Mahapatra, A.; Mukherjee, J. GRaphical footprint based Alignment-Free method (GRAFree) for classifying the species in Large-Scale Genomics. International Conference on Pattern Recognition and Machine Intelligence, 2019, pp. 105-112.
[http://dx.doi.org/10.1007/978-3-030-34872-4_12]
[27]
Alberts, B.; Johnson, A.; Lewis, J.; Walter, P.; Raff, M.; Roberts, K. Molecular Biology of the Cell 4th Edition: International Student Edition.
[28]
Ratmann, O.; Wiuf, C.; Pinney, J.W. From evidence to inference: probing the evolution of protein interaction networks. HFSP J.,, 2009, 3(5), 290-306.
[http://dx.doi.org/10.2976/1.3167215] [PMID: 20357887]
[29]
Randhawa, G.S.; Hill, K.A.; Kari, L. ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels. BMC Genomics, 2019, 20(1), 267.
[http://dx.doi.org/10.1186/s12864-019-5571-y] [PMID: 30943897]
[30]
Shannon, C.E. A mathematical theory of communication. Mob. Comput. Commun. Rev., 2001, 5(1), 3-55.
[http://dx.doi.org/10.1145/584091.584093]
[31]
Tenreiro Machado, J. Shannon entropy analysis of the genome code. Math. Probl. Eng., 2012, 2012
[http://dx.doi.org/10.1155/2012/132625]]
[32]
Wu, G.A.; Jun, S-R.; Sims, G.E.; Kim, S-H. Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method. Proc. Natl. Acad. Sci. USA,, 2009, 106(31), 12826-12831.
[http://dx.doi.org/10.1073/pnas.0905115106] [PMID: 19553209]
[33]
Pratt, J.W. Remarks on zeros and ties in the Wilcoxon signed rank procedures. J. Am. Stat. Assoc., 1959, 54(287), 655-667.
[http://dx.doi.org/10.1080/01621459.1959.10501526]
[34]
Wilcoxon, F. Individual Comparisons by Ranking Methods.Breakthroughs in Statistics; Springer, 1992, pp. 196-202.
[http://dx.doi.org/10.1007/978-1-4612-4380-9_16]
[35]
Ali, W.; Rito, T.; Reinert, G.; Sun, F.; Deane, C.M. Alignmentfree protein interaction network comparison. Bioinformatics 2014, 30(17), i430-i437.
[http://dx.doi.org/10.1093/bioinformatics/btu447] [PMID: 25161230]

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy