Binary Searching with Non-uniform Costs
and Its Application to Text Retrieval
Gonzalo Navarro, Ricardo Baeza-Yates, Eduardo Barbosa, Nivio Ziviani
and Walter Cunto
We study the problem of minimizing the expected cost of binary searching
for data where the access cost is not fixed and depends on the last
accessed element, such as data stored in magnetic or optical disk.
We present an optimal algorithm for this problem that finds the
optimal search strategy in O(n^3) time, which is the same time
complexity of the simpler classical problem of fixed costs.
Next, we present two practical linear expected time algorithms,
under the assumption that the access cost of an element is independent of
its physical position.
Both practical algorithms are online, that is, they find the next element
to access as the search proceeds.
The first one is an approximate algorithm which minimizes the
access cost disregarding the goodness of the problem partitioning.
The second one is a heuristic algorithm, whose quality depends on
its ability to estimate the final search cost, and therefore it
can be tuned by recording statistics of previous runs.
We present an application for our algorithms related to text retrieval.
When a text collection is large it demands specialized indexing
techniques for efficient access.
One important type of index is the suffix array, where data access is
provided through an indirect binary search on the text stored in magnetic disk
or optical disk.
Under this cost model we prove that the optimal algorithm cannot perform
better than Omega(1/ log n) times the standard binary search.
We also prove that the approximate strategy cannot, on average, perform worse
than 39% over the optimal one.
We confirm the analytical results with simulations, showing improvements
between 34% (optimal) and 60% (online) over standard binary search for
both magnetic and optical disks.