A Practical Index for Searching the Human Genome
Heikki Hyyrö and Gonzalo Navarro
Several proposals exist for indexed approximate string matching. It is known
that these approaches are, in principle, well suited to computational biology
tasks, but in most cases no effort has been made to apply them. In this paper
we choose some of the principles behind these previous indices, and tune them
to suit the real case of searching the human genome. We also pay attention to
issues such as secondary memory and uneven distribution, which are disregarded
in many proposals, but have significance when indexing the human genome.
Unique features of our index are: optimized of selection of pattern pieces,
bidirectional text verification, and optimized piece neighborhood generation.
We find empirically our index to be 2 to 10 times faster than the best previous
approach when the index fits in memory.