Abstract
<div class="line" id="line-5"> In this paper, we present a nearly tight analysis of the encoding length of the Burrows-Wheeler Transform (BWT) that is motivated by the text indexing setting. For a text <i> T </i> of <i> n </i> symbols drawn from an alphabet Σ, our encoding scheme achieves bounds in terms of the <i> h </i> th-order empirical entropy <i> Hh </i> of the text, and takes linear time for encoding and decoding. We also describe a lower bound on the encoding length of the BWT that constructs an infinite (non-trivial) class of texts that are among the hardest to compress using the BWT. We then show that our upper bound encoding length is <i> nearly tight </i> with this lower bound for the class of texts we described.</div><div class="line" id="line-108"> In designing our BWT encoding and its lower bound, we also address the <i> t-subset problem </i> ; here, the goal is to store a subset of <i> t </i> items drawn from a universe [1‥ <i> n </i> ] using just lg( <i> nt </i> ) + <i> O </i> (1) bits of space. A number of solutions to this basic problem are known, however encoding or decoding usually requires either <i> O </i> ( <i> t </i> ) operations on large integers [Knu05, Rus05] or <i> O </i> ( <i> n </i> ) operations. We provide a novel approach to reduce the encoding/decoding time to just <i> O </i> ( <i> t </i> ) operations on <i> small </i> integers (of size <i> O </i> (lg <i> n </i> ) bits), without increasing the space required.</div><div class="line" id="line-363"> <br/></div><div class="line" id="line-366"> <br/></div><div class="line" id="line-380"> <br/></div>
| Original language | American English |
|---|---|
| State | Published - 2008 |
Disciplines
- Computer Sciences
Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS