Repetitive sequences

In this page we provide some datasets of repetitive sequences with long runs. All the sequences were generated by computing the Burrows-Wheeler Transform (BWT) of repetitive sequences.

Wiki (538 MB)

Download dataset

BWT of the edit history of some Wikipedia pages (see pages), using words as symbols. The Wikipedia Extractor was used to transform the XML file of the pages into a simpler text file. Then, this script was used to convert words into a contiguous integer alphabet. The final BWT can be obtained by using any suffix array algorithm for integer alphabet. This dataset has 140,990,835 symbols, an alphabet size of 174,796 and 2,586,752 of runs. Each symbol is encoded using 4 bytes.

Wiki (319 MB)

Download dataset

BWT of the edit history of some Wikipedia pages (see pages), using words as symbols. The Wikipedia Extractor was used to transform the XML file of the pages into a simpler text file. Then, this script was used to convert words into a contiguous integer alphabet. The final BWT can be obtained by using any suffix array algorithm for integer alphabet. This dataset has 83,374,477 symbols, an alphabet size of 188,932 and 2,694,892 of runs. Each symbol is encoded using 4 bytes.

World leaders 1B (45 MB)

Download dataset

BWT of the repetitive dataset World leaders of the Pizza&Chili corpus. This dataset has 46,968,182 symbols, an alphabet size of 90 and 573,487 of runs. Each symbol is encoded using 1 byte.

World leaders 2B (90 MB)

Download dataset

BWT of the repetitive dataset World leaders of the Pizza&Chili corpus. For this dataset, instead of taking the previous symbol during the BWT construction, the two previous symbols were taken. This dataset has 46,968,182 symbols, an alphabet size of 2,528 and 875,406 of runs. Each symbol is encoded using 2 bytes.

Kernel 1B (247 MB)

Download dataset

BWT of the repetitive dataset Kernel of the Pizza&Chili corpus. This dataset has 257,961,617 symbols, an alphabet size of 161 and 2,791,368 of runs. Each symbol is encoded using 1 byte.

Kernel 2B (493 MB)

Download dataset

BWT of the repetitive dataset World leaders of the Pizza&Chili corpus. For this dataset, instead of taking the previous symbol during the BWT construction, the two previous symbols were taken. This dataset has 257,961,617 symbols, an alphabet size of 7,124 and 4,194,799 of runs. Each symbol is encoded using 2 bytes.