Information Retrieval: Data Structures & Algorithms

Preface

Text is the primary way that human knowledge is stored, and after speech, the primary way it is transmitted. Techniques for storing and searching for textual documents are nearly as old as written language itself. Computing, however, has changed the ways text is stored, searched, and retrieved. In traditional library indexing, for example, documents could only be accessed by a small number of index terms such as title, author, and a few subject headings. With automated systems the number of indexing terms that can be used for an item is virtually limitless.

The subfield of computer science that deals with the automated storage and retrieval of documents is called information retrieval (IR). Automated IR systems were originally developed to help manage the huge scientific literature that has developed since the 1940's, and this is still the commonest use of IR systems. IR systems are in widespread use in university, corporate, and public libraries. IR techniques have also been found useful, however, in such disparate areas as office automation and software engineering. Indeed any field that relies on documents to do its work could potentially benefit from IR techniques.

IR shares concerns with many other computing sub-disciplines, such as artificial intelligence, multi-media systems, parallel computing, human factors, etc. Yet, in our observation, IR is not widely known in the computer science community. It is often confused with DBMS \(em a field with which it shares concerns and yet from which it is distinct. We hope that this book will make IR techniques more widely known and used.

Data structures and algorithms are fundamental to computer science. Yet, despite a large IR literature, the basic data structures and algorithms of IR have never been collected in a book. This is the need that we are attempting to fill. In discussing IR data structures and algorithms we attempt to be evaluative as well as descriptive. We discuss relevant empirical studies that have compared the algorithms and data structures, and some of the most important algorithms are presented in detail, including implementations in C.

Our primary audience is software engineers building systems with text processing components. Students of computer science, information science, library science, and other disciplines who are interested in text retrieval technology should also find the book useful. Finally, we hope that information retrieval researchers will use the book as a basis for future research.

Bill Frakes

Ricardo Baeza-Yates

Acknowledgements

Many people improved this book with their reviews. The authors of the chapters did considerable reviewing of each others' work. Other reviewers include Jim Kirby, Jim O'Connor, Fred Hills, Gloria Hasslacher, and Ruben Prieto-Diaz. All of them have our thanks. Special thanks to Chris Fox, who tested The Code of the book; to Steve Wartik for his patient unravelling of many Latex puzzlez; and to Donna Harman for their helpful suggestions.