Abstract
Most efficient read mappers build a Ferragina-Manzini index of a genome sequence and then process reads against it. In order to handle differences between reads and corresponding genome fragments, approximate read occurrences are searched in the index. This technique is particularly efficient for mapping reads of length ∼30bp with up to 2-3 errors, as first massive sequencers required.
However, within the last few years, in most popular sequencing technologies read length increased to 75 − 200bp. Since the number of required index queries is exponential with respect to the number of errors, it is hard to maintain the allowed error rate within this method.
We propose a new approach that overcomes this problem. The main idea is to use the Ferragina-Manzini index to filter potential approximate read occurrences. Filtering is based on the intermediate partitioning concept, i.e. reads are split into parts, which are searched in index with reduced number of errors.
We implemented this method in Bmap program. Our experiments show that Bmap outperforms current methods in efficiency without sacrificing mapping accuracy.
Keywords: Approximate string matching, ferragina-manzini index, intermediate partitioning, next generation sequencing, read mapping, succinct suffix array.
Graphical Abstract