Burrows Wheeler Transform for Compression

How does the BWT work, and why is it useful for compression?

2026-04-04

programming

The Burrows-Wheeler Transform is a “compression booster” that can be used alongside other techniques to improve the predictability of a given text, thus resulting in a more effective overall compression pipeline.

Consider the following text as an example.

supercalifragilisticexpialidocious

Apply an encoding like Huffman to the input to get variable-length codes for each symbol.

Input: supercalifragilisticexpialidocious

Symbol  Freq    Huffman Code
----------------------------
i       7       00
a       3       010
c       3       1111
e       2       1110
l       3       1100
o       2       1000
p       2       1011
r       2       1010
s       3       1001
u       2       11011
x       1       01110
d       1       01111
f       1       01100
g       1       01101
t       1       11010

Notice that frequent symbols are encoded with less bits. Now the input string can be represented using 126 bits instead of 272, or about 54% less than if each symbol were to be represented using 8 bits. There’s a bit of additional space overhead that comes along with this encoding, but that’s besides the point here.

Leveraging the overall frequency of symbols is not the only way to achieve compression. The Huffman Coding above is an example of Zero-Order Entropy compression, since it does not take into account each symbol’s “context”.

When compressing a text, there’s often contextual information that can be leveraged to improve results. For example, consider the fact that certain letters or words tend to follow others within English text, or that highly repetitive sequences of symbols can be taken advantage of, rather than the frequency of individual symbols alone.

The Burrows-Wheeler Transform (BWT) can be used leverage this idea of context in order to achieve better compression. It’s used alongside other techniques, such as Run-Length Encoding (RLE), and the Move-to-Front Transform (MTF).

For example, here’s a An overly simplistic, poor man’s compression pipeline that mimics what bzip2 does:

cat sample.txt |\
  standardize.py |\       # eg. 'hi there' -> 104 105 32 116 104 101 114 101
  rle.py |\
  bwt_suffix_array.py |\
  mtf.py |\
  rle.py |\
  huffman.py |\
  standardize.py -d |\
  wc -c                   # eg. generally less bytes than with just huffman.py alone

How’s the BWT work?

The BWT rotates symbols from the original input text, and then sorts the all permutations. It produces a (conceptual) matrix where each row and column is a permutation of the original text.

To apply the BWT to a text of length n, start by appending an EOL symbol is lexicographically smaller than any other symbol from the text. Begin rotating symbols from the back to the front n times. Front to back is also fine, since the next step is to sort the permutations. The BWT is the last column of the matrix.

         example
APPEND   │
         ▼
         example$
ROTATE   │
         ├──► example$
         ├──► $example
         ├──► e$exampl
         ├──► le$examp
         ├──► ple$exam
         ├──► mple$exa
         ├──► ample$ex
         ├──► xample$e
SORT     │
         ▼
         $example
         ample$ex
         e$exampl    (Conceptual)
         example$     BWT Matrix
         le$examp
         mple$exa
         ple$exam
         xample$e
SLICE    │
         ▼      ┌─┐
         $exampl│e│
         ample$e│x│
         e$examp│l│
         example│$│
         le$exam│p│
         mple$ex│a│
         ple$exa│m│
         xample$│e│
                └┬┘
         ┌───────┘
         ▼
✨BWT✨  exl$pame

Though it’s not immediately obvious how, the BWT can be reversed to retrieve the original input string (in more ways than one).

Start by sorting the BWT, keeping track of each symbol’s original index.

SORT

 e x l $ p a m e
─────────────────
 0 1 2 3 4 5 6 7
 │ │ │ │ │ │ │ └────┐
 │ │ │ │ │ │ └──────┼───────────┐
 │ │ │ │ └─┼────────┼───────────┼─────┐
 │ │ └─┼───┼────────┼─────┐     │     │
 │ └───┼───┼────────┼─────┼─────┼─────┼─────┐
 └─────┼───┼──┐     │     │     │     │     │
  ┌────┘┌──┘  │     │     │     │     │     │
  ▼     ▼     ▼     ▼     ▼     ▼     ▼     ▼
($,3) (a,5) (e,0) (e,7) (l,2) (m,6) (p,4) (x,1)
────────────────────────────────────────────────
  0     1     2     3     4     5     6     7

Note: Notice that the sorted BWT corresponds to the first column of the BWT Matrix.

The sorted BWT acts as a chain of instructions to get back to the original string.

Traverse the sequence of symbol-index “pairs”. Begin with the first pair. Read off the symbol (the first value from the pair), and then use the index (the second value from the pair) to determine which pair to visit next.

Repeat these steps until arriving back at index 0 (the first symbol should always be the special, lexicographically-lower-than-anything-else symbol, so there is no need to read it off).

TRAVERSE
              ┌───────────┐
              │           │     ┌─────┐
 A┌───────────┼─────┐    G│    E│     │     C
  │     ┌─────┼─────┼─────┼─────┼─────┼─────┐
  ▲     ▼     ▼     ▼     ▲     ▲     ▼     ▲
($,3) (a,5) (e,0) (e,7) (l,2) (m,6) (p,4) (x,1)
────────────────────────────────────────────────
  0     1     2     3     4     5     6     7
  ▲     ▼     ▼     ▼     ▲     ▲     ▼     ▲
  │     └─────┼─────┼─────┼─────┘     │     │
  │     D     │    B│     │          F│     │
  │           │     └─────┼───────────┼─────┘
  ▲           ▼           │           │
START        END          └───────────┘


A                    e______
A─►B                 ex_____
A─►B─►C              exa____
A─►B─►C─►D           exam___
A─►B─►C─►D─►E        examp__
A─►B─►C─►D─►E─►F     exampl_
A─►B─►C─►D─►E─►F─►G  example ✨original input✨

While the illustrations above may not show it, both construction and reversal of the BWT can be done in linear-time. A suffix array may be used for construction.

For reversal, the key concept is called, “LF-Mapping”: The rank of each symbol in the BWT correspond to the symbols with the same rank in the first column of the matrix. Reversal via the LF-Mapping property can be done as follows:

  1. For each position in the BWT, determine its rank: how many times that same symbol has appeared before it. The total frequency of each symbol can be obtained as a byproduct.
BWT:    e l o $ g o g
──────────────────────
Ranks:  0 0 0 0 0 1 1

Frequencies:
{e: 1, l: 1, o: 2, $: 1, g: 2}
  1. Determine the “cumulative count” (cc) for each symbol, which is a running sum of the frequencies of each symbol (in sorted order).
{e: 1, l: 1, o: 2, $: 1, g: 2}

│  SORT  │
▼        ▼

{$: 1, e: 1, g: 2, l: 1, o: 2}
    │     │     └────┐│
    │     └─────────┐││
    └─────────┐     │││
cc['$'] = 0   │     │││
          │   │     │││
          ▼   ▼     │││
cc['e'] = 0 + 1 = 1 │││
          ┌───────┘ │││
          │   ┌─────┘││
          ▼   ▼      ││
cc['g'] = 1 + 1 = 2  ││
          ┌───────┘  ││
          │   ┌──────┘│
          ▼   ▼       │
cc['l'] = 2 + 2 = 4   │
          ┌───────┘   │
          │   ┌───────┘
          ▼   ▼
cc['o'] = 4 + 1 = 5
  1. Starting at the sentinel’s position, walk the BWT by collecting the current symbol and jumping to the next position using the following formula (because the k-th occurrence of a symbol in the last column corresponds to the k-th occurrence in the first column - in other words, their ranks are the same): next_index = cc[symbol] + rank[index]
e l o $ g o g
      ▲
      ├─ cc['$'] = 0
      └─ rank[3] = 0
      ──────────────
         sum     = 0 (next index)
         result  = e _ _ _ _ _

e l o $ g o g
▲
├─ cc['e'] = 1
└─ rank[0] = 0
──────────────
   sum     = 1 (next index)
   result  = e l _ _ _ _

e l o $ g o g
  ▲
  ├─ cc['l'] = 4
  └─ rank[1] = 0
  ──────────────
     sum     = 4 (next index)
     result  = e l g _ _ _

e l o $ g o g
        ▲
        ├─ cc['g'] = 2
        └─ rank[4] = 0
        ──────────────
           sum     = 2 (next index)
           result  = e l g o _ _

e l o $ g o g
    ▲
    ├─ cc['o'] = 5
    └─ rank[2] = 0
    ──────────────
       sum     = 5 (next index)
       result  = e l g o o _

e l o $ g o g
          ▲
          ├─ cc['o'] = 5
          └─ rank[5] = 1
          ──────────────
             sum     = 6 (next index)
             result  = e l g o o g
  1. Reverse the collected symbols to get the original text.
elgoog

│ REVERSE │
▼         ▼

google ✨original input✨

Why does it help improve compression?

It’s difficult to understand why the BWT is useful from the small example shown above. The neat property about it is that the symbols end up being very predictable and repetitive. High predictability equals low entropy, which makes for effective compression.

Each symbol in the BWT is sorted by it’s right-side context. Consider common words like “the”, “there”, “their”, “they”, etc. Each “t” has a similar suffix. The BWT tends to bring these details to the surface. For example:

Input:   "the", "there", "their", "they"
BWT:     ",,,yeer$   """rhhhhtttteie""""e

Symbols like “t” and “h” tend to get grouped into runs.

Recall that the BWT is the last column of the (conceptual) BWT matrix, however, moving it to the front helps to see each symbol’s corresponding suffix:

...
,░"their",░"...│"│
,░"there",░"...│"│
,░"they"$"th...│"│
e",░"their",...│r│
e",░"there",...│h│
eir",░"they"...│h│
ere",░"their...│h│
ey"$"the",░"...│h│
he",░"there"...│t│
heir",░"they...│t│
here",░"thei...│t│
hey"$"the",░...│t│
ir",░"they"$...│e│
...

  │
  ▼
...
│"│,░"their",░"...
│"│,░"there",░"...
│"│,░"they"$"th...
│r│e",░"their",...
│h│e",░"there",...
│h│eir",░"they"...
│h│ere",░"their...
│h│ey"$"the",░"...
│t│he",░"there"...
│t│heir",░"they...
│t│here",░"thei...
│t│hey"$"the",░...
│e│ir",░"they"$...
...

This tool can be used to play around with the BWT transform to help visualize what it does.