A Language for Queries on Structure and Contents of Textual Databases

Gonzalo Navarro (Ricardo Baeza-Yates, advisor)

This thesis focuses on the problem of finding a suitable query language for hierarchically structured textual databases.

The problem about current approaches is that there is no consensus on how the structuring model and the query language should be, and that they focus strongly either on expressivity or on efficiency issues, but not on both at the same time. The approaches which are strong in one point are weak in the other. Moreover, there is no formal and complete foundation to analyze the expressivity of these languages.

The goal of this thesis is to find a structural model and a query language that is expressive and efficiently implementable, achieving a good compromise between the two extremes.

In order to achieve this goal, a number of steps have been carried out. In the first place, a comprehensive study and evaluation of previous work on the field has been done. Then, a structuring model and query language with the desired characteristics has been defined. Its expressivity has been compared against similar models, formally and practically. An informal framework to compare the expressivity of similar models has been defined. Then, we focused on implementation. Algorithms have been defined and their worst-case space and time complexity analyzed for all operations, in many different versions for implementing indices and operations. Finally, a real prototype has been developed implementing the proposed model, to evaluate heuristics and draw average running times.

This work leads to the conclusion that a set-oriented query language based on operations on nearby structure elements of one or more hierarchies is quite expressive and efficiently implementable. It also gives an idea of up to where the expressivity can be enriched without degrading the performance. Finally, it suggests some research directions.

This work makes a step in the direction of obtaining a unifying perspective on how a query language for textual databases should be, what expressive power should it have and how well can it be implemented. All this is necessary to put the emerging area of textual databases in the place it deserves in Computer Science.