An Effective Clustering Algorithm to Index High Dimensional Metric Spaces

Edgar Chávez and Gonzalo Navarro

A metric space consists of a collection of objects and a distance function defined among them, which satisfies the triangular inequality. The goal is to preprocess the set so that, given a set of objects and a query, retrieve those objects close enough to the query. The number of distances computed to achieve this goal is the complexity measure. The problem is very difficult in the so-called high-dimensional metric spaces, where the histogram of distances has a large mean and a small variance. A recent survey on methods to index metric spaces has shown that the so-called clustering algorithms are better suited than their competitors, pivot-based algorithms, to cope with high-dimensional metric spaces. In this paper we present a new clustering method that achieves much better performance than all the existing data structures. We present analytical and experimental results that support our claims and that give the users the tuning parameters to make optimal use of this data structure.