Entropy and Coding

Notes from Compact Data Structures, by Gonzalo Navarro

2023-10-19

succinct

Worst-Case Entropy

The minimum number of bits required to unambiguously identify an object from a set.

⌈log₂(|U|)⌉ bits are needed to uniquely identify all elements of a set, U.

Example:

|U| = 5, ⌈log₂(5)⌉ = ⌈2.3⌉ = 3

red    -->  00
green  -->  01
blue   -->  10
yellow -->  11
orange --> 100  *3

Shannon Entropy

Comes from, A Mathematical Theory of Communication, 1948.

Takes probabilities into account.

u ∈ U with probabilities Pr(u)

Varying length codes of l(u) bits to reduce the overall average code length:

The minimum possible average code length is represented using the Shannon Entropy of the probability distribution Pr : U -> [0, 1].

“High Entropy” == Lots of information
“Lots of surprise” == High entropy
“Less predictable” == More information == High entropy

log₂(1/Pr(u)) bits is the optimal code length for u.

Shannon Entropy coincides with Worst-Case Entropy when all of the probabilities are equal.

The longest code lengths will always be greater than or equal to the Worst-Case Entropy (with the ceiling function).

Empirical Entropy

AKA Binary Entropy, a special case of Shannon Entropy

U = {0,1} (Infinite source emits just bits)

1 is emitted with probability p.
0 is emitted with probability 1 - p.

In other words, when there are a predefined number of elements in U, each having independent probabilities, then the entropy of all events is equal to the sum of each event/element’s entropy.

It’s the notion of entropy without assuming the source. It’s “observed”.

Example: B is a concrete bit sequence. The goal is to compress it. The model/source that created the sequence is unknown.

B 1110101000101101010...
                    n ->

Let m be the number of 1‘s, and assume the probability of the source emitting a 1 is p = m / n.

This leads to the concept of Zero-Order Empirical Entropy.

Zero-Order Empirical Entropy

It’s the Shannon Entropy of a source that emits 1‘s based on the observed probability of the number of 1‘s in B.

It’s a function of the sequence, B, rather than the probabilities.

“Zero-order” because it is considered memoryless, which basically means that the only the frequencies of the symbols are taken into account when compressing. There are other aspects about sequences that can be used to achieve better compression, but zero-order entropy does not take those into account.