Parallel Generation of Inverted Lists for Distributed Text Collections
Berthier Ribeiro, João Paulo Kitajima, Gonzalo Navarro,
Cláudio Sant'Ana and Nivio Ziviani
We present a scalable algorithm for the parallel computation of inverted
files for large text colections. The algorithm takes into account an
environment of a high bandwidth network of workstations with a
shared-nothing memory organization. The text collection is assumed to be
evenly distributed among the disks of the various workstations. Compression
is used to save space in main memory (where inverted lists are kept) and
to save time when data have to be moved across the network. The algorithm
average running cost is O(t/p) where t is the size of the
whole text collection and p is the number of available processors.
We implemented our algorithm and drew experimental results. In a 100
Mbits/s switched Ethernet network with 4 PentiumPro 200 megahertz, 128
megabytes RAM on each processor, we were able to invert 2 gigabytes of
TREC documents in 15 minutes. Further, we also proposed an analytical model
for the algorithm execution time.