Abstract
Background: Advancement in sequencing technology yields a huge number of genomes from a multitude of organisms on our planet. One of the fundamental tasks for processing and analyzing these sequences is to organize them in the existing taxonomic orders.
Methods: Recently, we proposed a novel approach, GenFooT, for taxonomy classification using the concept of genomic footprint (GFP). The technique is further refined and enhanced in this work leading to improved accuracies in the task of taxonomic classification based on various benchmark datasets. GenFooT maps a genome sequence in a 2D coordinate space and extracts features from that representation. It uses two hyper-parameters, namely block size and number of fragments of genomic sequence while computing the feature. In this work, we propose an analysis of choosing values of those parameters adaptively from the sequences. The enhanced version of GenFooT is named GenFooT2.
Results: We have tested GenFooT2 on ten different biological datasets of genomic sequences of various organisms belonging to different taxonomy ranks. Our experimental results indicate a 3% improved classification performance of the proposed GenFooT2 featured with a Logistic regression classifier as compared to GenFooT. We also performed the statistical test to compare the performance of GenFooT2 to the state-of-the-art methods including our previous method, GenFooT.
Conclusion: The experimental results as well as the statistical test exhibit that the performance of the proposed GenFooT2 is significantly better.
Keywords: Taxonomy classification, mitochondrial genome, genomic footprint, alignment-free, shannon entropy, supervised classification, biological dataset
Graphical Abstract
[http://dx.doi.org/10.1371/journal.pone.0119248] [PMID: 25923521]
[http://dx.doi.org/10.1093/nar/gkw396] [PMID: 27166378]
[http://dx.doi.org/10.1371/journal.pbio.1001127] [PMID: 21886479]
[http://dx.doi.org/10.1371/journal.pbio.1001130] [PMID: 21886482]
[http://dx.doi.org/10.2307/2413485]
[http://dx.doi.org/10.1093/nar/25.17.3389] [PMID: 9254694]
[http://dx.doi.org/10.1093/bioinformatics/btq461] [PMID: 20709691]
[http://dx.doi.org/10.1007/s00705-014-2197-x] [PMID: 25119676]
[http://dx.doi.org/10.1128/JVI.07173-11] [PMID: 22278230]
[http://dx.doi.org/10.1016/j.ympev.2012.10.020] [PMID: 23142697]
[http://dx.doi.org/10.1093/bib/bbt083] [PMID: 24291823]
[http://dx.doi.org/10.1007/s10910-011-9811-x]
[http://dx.doi.org/10.1186/s13059-017-1319-7] [PMID: 28974235]
[http://dx.doi.org/10.1186/1471-2105-11-538] [PMID: 21034504]
[http://dx.doi.org/10.1371/journal.pcbi.1000581] [PMID: 19956739]
[http://dx.doi.org/10.1146/annurev-biodatasci-080917-013431] [PMID: 31828235]
[http://dx.doi.org/10.1038/s41598-019-40452-6] [PMID: 30842590]
[http://dx.doi.org/10.1007/0-387-27733-1_12]
[http://dx.doi.org/10.1016/j.ympev.2015.12.011] [PMID: 26724405]
[http://dx.doi.org/10.3998/ark.5550190.0007.907]
[http://dx.doi.org/10.1002/qua.24479]
[http://dx.doi.org/10.1038/nrmicro2350] [PMID: 20395967]
[http://dx.doi.org/10.1186/s12859-017-1602-3] [PMID: 28399797]
[http://dx.doi.org/10.1093/nar/gku739] [PMID: 25120265]
[http://dx.doi.org/10.1007/978-3-030-34872-4_12]
[http://dx.doi.org/10.2976/1.3167215] [PMID: 20357887]
[http://dx.doi.org/10.1186/s12864-019-5571-y] [PMID: 30943897]
[http://dx.doi.org/10.1145/584091.584093]
[http://dx.doi.org/10.1155/2012/132625]]
[http://dx.doi.org/10.1073/pnas.0905115106] [PMID: 19553209]
[http://dx.doi.org/10.1080/01621459.1959.10501526]
[http://dx.doi.org/10.1007/978-1-4612-4380-9_16]
[http://dx.doi.org/10.1093/bioinformatics/btu447] [PMID: 25161230]