
Which follows is the original manual of use by Kunihiko Sadakane, only changed
by me w.r.t. the "reportLevel" comment. Inside the distribution you will find
csa0507a.tar.gz, which is a newer version by the author. I have made my tests
against the main version included, but the new version should be faster for
counting queries. --- Gonzalo Navarro

Compressed Suffix Array

Kunihiko Sadakane (sada@dais.is.tohoku.ac.jp)

%mkarray file
  Create the suffix array "file.sa" from "file."

%mkcsa file [D L]
  Create the compressed suffix array "file.psi" and "file.idx"
    from "file" and "file.sa."
  D is the interval of two indices of the suffix array stored explicitly.
  That is, SA[i*D] is stored explicitly.
  L is the interval of two indices of the psi function stored explicitly.
  That is, Psi[i*L] is stored explicitly.
  Default values are D=16 and L=128.

%csa file
  Search keywords using the compressed suffix array "file.psi" and "file.idx."
   (change value of reportLevel variable in the beginning of suftest4.c to
    get lower reporting levels)

%chendian file
  Change the endian of "file.psi" and "file.idx."
  Output files are "file_.psi" and "file_.idx."
  Note: SPARC is big endian, while Intel x86 is little endian.
        No information on the endian is stored in "file.idx."

----------------------------------------------
Data structures

Data structures are stored in two files: "file.psi" and "file.idx."
"file.psi" stores only the psi function encoded by gamma code,
and "file.idx" stores other information such as the length of the text,
pointers to the psi function, suffix pointers stored explicitly, etc.
All values in "file.idx" are stored as 32-bit integers, while bit-stream
in "file.psi" are stored as 16-bit words.

The text has length n and it is represented by T[1..n].
A unique terminator $ is imaginary added as the (n+1)-th character T[n+1].
We assume that the terminator is alphabetically smaller than any other
character.

The suffix array of the text is represented by SA[0..n].
Note that SA[0] is always n+1 because the (n+1)-th suffix T[n+1..n+1]=$
is the smallest suffix in lexicographic order.

Values SA[i*D] (i=0,1,...,n/D) are stored explicitly in "file.idx."
The psi function Psi[i] is defined as Psi[i]=SA^{-1}[SA[i]+1] for i=1,2,...,n.
It is divided into blocks, each stores Psi[j*L..j*(L+1)-1].
The head value of each block Psi[j*L] (j=1,2,...,n/L) are stored explicitly
in "file.idx" and other values are converted to the difference from the head
value and encoded by gamma code in "file.psi."
The head values and pointers to the bit-stream in "file.psi" are stored in
the array R in "file.idx."
If Psi[i]<Psi[i-1], Psi[i] is encoded as two numbers: (n+65536)-Psi[i-1]
and Psi[i]+1. The magic number comes from the use of decoding tables for
gamma code. The tables has 16-bit width. The "+1" term comes from that
gamma code represents numbers greater than zero, while the psi value
is greater than or equal to zero.

