A Practical Index for Searching the Human Genome

Heikki Hyyrö and Gonzalo Navarro

Several proposals exist for indexed approximate string matching. It is known that these approaches are, in principle, well suited to computational biology tasks, but in most cases no effort has been made to apply them. In this paper we choose some of the principles behind these previous indices, and tune them to suit the real case of searching the human genome. We also pay attention to issues such as secondary memory and uneven distribution, which are disregarded in many proposals, but have significance when indexing the human genome. Unique features of our index are: optimized of selection of pattern pieces, bidirectional text verification, and optimized piece neighborhood generation. We find empirically our index to be 2 to 10 times faster than the best previous approach when the index fits in memory.