Optimized Binary Search and Text Retrieval (Preliminary Version)

Gonzalo Navarro, Eduardo Barbosa, Chris Perleberg, Ricardo Baeza-Yates and Nivio Ziviani

We present an algorithm that minimizes the expected cost of indirect binary search for data with non-constant access costs (e.g. disk data). The term "indirect" indicates that sorted access to the data is obtained through an array of pointers to the raw data. One immediate application of this algorithm is to improve the retrieval performance of disk databases that are indexed using the suffix array model (also called PAT array). We consider the cost model of magnetic and optical disks and the anticipated knowledge of the expected size of the subproblem produced by reading each disk track. This information is used to devise a modified binary searching algorithm to reduce overall retrieval costs. Both an optimal and a practical algorithm are presented, together with analytical and some experimental results, showing that standard binary search can be improved by 50% or more.