Abstract
Background: New generation sequencing machinery such as Illumina and Solexa can generate millions of reads from given genome sequence on a single run. There is a need for suitable data structure, efficient with respect to memory as well as time to align these enormous reads into reference genome. There are a number of existing techniques of indexing and reads alignment, such as MAQ, Bowtie, BWA, BWBBLE and Kart. Memory efficient versions of these techniques are 10- 20% slower than their respective normal versions.
Objective: A new approach for efficient indexing and retrieval of large genomic data.
Methods: In this paper, we propose an efficient method based on Burrows Wheeler Transform and Wavelet Tree (BWIT) for genome sequence indexing and reads alignment. Both types of alignments (exact and approximate) are possible by the proposed approach (BWIT).
Results: The performance of BWIT is experimentally found to be better than existing ones with respect to both memory and speed. Experimental work shows that proposed approach performs best in case of protein sequence indexing. All the existing read alignment approaches depend upon the size of index used. In general, time required increases with reduction in index size used. Experiments have been done with Bowtie, BWA & Kart by taking index size as 1.25N, 1.05N, .98N, where N is the size of the text (reference genome). In our approach BWIT index size is .6N which is lesser than index size used in all other approaches. It is observed that even using smallest index size alignment time in our approach is least.
Conclusion: An innovative indexing technique is presented to address the problem of storage, transmission and retrieval of large DNA/Protein Sequence data.
Keywords: Indexing, retrieval, burrows wheeler transform, wavelet tree, genome, indexing, sequencing, alignments.
Graphical Abstract