Compression Fundamentals

Why This Matters

A 1 MiB English text file often gzip-compresses to about 250 KiB, near 2 bits per character if the original uses 8-bit bytes. A 1 MiB file filled with uniform random bytes usually stays near 1 MiB and may grow by a header plus block overhead.

The difference is not implementation luck. English has biased symbols and repeated substrings. Random bytes have neither. Shannon entropy gives the average bits-per-symbol floor for a source model, Huffman coding constructs prefix codes near that floor, and LZ77 turns repeated byte strings into length-distance references.

Core Definitions

Definition

Lossless Compression

A lossless compressor maps an input byte string to a shorter byte string plus enough structure for exact reconstruction. If the decompressor returns even one bit different from the input, the scheme is not lossless for that input.

Definition

Lossy Compression

A lossy compressor returns an approximation chosen to preserve a task-specific notion of quality. JPEG may change pixel values. MP3 may remove masked audio components. Neural-network weight quantization may change floating-point weights while preserving validation accuracy within a tolerance.

Definition

Entropy of a Discrete Source

For a discrete random variable $X$ with probabilities $p(x)$ , the Shannon entropy is $H(X) = -\sum_x p(x)\log_2 p(x)$ . It is measured in bits per symbol when the logarithm has base 2.

Definition

Prefix Code

A prefix code assigns each symbol a bit string such that no codeword is a prefix of another. This lets a decoder read bits left to right without a separator between symbols.

Entropy as the Bits-Per-Symbol Floor

For a source with symbols $x$ and probabilities $p(x)$ , entropy is

$H(X) = -\sum_x p(x)\log_2 p(x).$

A fair bit has entropy 1 bit because both outcomes have probability $1/2$ . A byte uniformly sampled from 256 values has entropy 8 bits. No lossless compressor can reduce an independent stream of such bytes on average under the true distribution.

Biased symbols are different. Suppose a source emits A with probability $1/2`, `B` with probability$ 1/4, C with probability $1/8, and D with probability $1/8`. Its entropy is

$H(X) = -(0.5\log_2 0.5 + 0.25\log_2 0.25 + 0.125\log_2 0.125 + 0.125\log_2 0.125).$

$H(X) = 0.5 + 0.5 + 0.375 + 0.375 = 1.75.$

A fixed-width encoding for four symbols uses 2 bits per symbol. A variable-length code can do better by assigning shorter codewords to frequent symbols.

Here is a minimal C calculation for entropy from counts.

#include <math.h>
#include <stdio.h>

int main(void) {
    int count[] = {50, 25, 12, 13};
    int n = 100;
    double H = 0.0;

    for (int i = 0; i < 4; i++) {
        double p = (double)count[i] / n;
        H -= p * log2(p);
    }

    printf("%.3f bits per symbol\n", H);
    return 0;
}

This prints about 1.743 bits per symbol.

For English text, the true number depends on the source model. Character frequencies alone give a few bits per character. Models using context, words, and long-range constraints reduce the estimated entropy. Shannon's classic experiments placed English around 1 bit per character for skilled prediction; a practical rule for modern gzip on ordinary English prose is about 2 bits per byte-sized character, with entropy estimates often quoted around 1.3 bits per character for strong language models.

No Compressor Wins on Every File

Lossless decompression must be one-to-one on the set of compressed representations it accepts. If two different inputs map to the same compressed bytes, the decompressor cannot know which original to return.

For all $n$ -bit files, there are $2^n$ possible inputs. The set of bit strings shorter than $n$ bits has size

$1 + 2 + 4 + \cdots + 2^{n-1} = 2^n - 1.$

That is one slot too few. So no injective lossless compressor can map every $n$ -bit input to a shorter output. If it shortens some files, it must leave some unchanged or make some longer.

A practical compressor often adds a magic number, method byte, checksum, and length fields. For a short input like hi, gzip metadata dominates the two data bytes. This is not a defect in the entropy theorem; it is fixed overhead plus the fact that tiny samples do not expose much redundancy.

Huffman Coding by Hand

Huffman coding constructs an optimal prefix code for a known set of symbol frequencies when each symbol is encoded separately. It repeatedly merges the two lowest-frequency nodes into a binary tree. Left and right edges become 0 and 1.

Take the string BANANA_BANDANA. Its 14 characters have these counts.

Symbol	Count	Probability
`A`	6	0.4286
`N`	4	0.2857
`B`	2	0.1429
`_`	1	0.0714
`D`	1	0.0714

The entropy is about 1.985 bits per character. A fixed-width code for five symbols needs 3 bits per character, so the 14-character string takes 42 bits before storing any table.

Huffman merges the two count-1 symbols, then merges count-2 B with that node, then merges count-4 N with that node, then merges count-6 A with the rest. One valid code is

Symbol	Code	Length
`A`	`0`	1
`N`	`10`	2
`B`	`110`	3
`_`	`1110`	4
`D`	`1111`	4

The encoded payload is

B   A N  A N  A _    B   A N  D    A N  A
110 0 10 0 10 0 1110 110 0 10 1111 0 10 0

Concatenated, that is 28 bits.

1100100100111011001011110100

Packed into bytes with four zero pad bits at the end, the payload bytes are

11001001 00111011 00101111 01000000
0xC9     0x3B     0x2F     0x40

The average code length is $28/14 = 2.0$ bits per character, very close to the 1.985-bit entropy for this empirical distribution. Real file formats must also store or reconstruct the codebook. For small inputs, that overhead can erase the gain.

Arithmetic Coding and the Epsilon Gap

Huffman coding assigns an integer number of bits to each symbol. If a symbol has probability 0.9, the ideal length is $-\log_2 0.9 \approx 0.152$ bits, but a per-symbol prefix code cannot spend a fractional bit on one occurrence.

Arithmetic coding encodes a whole message as a subinterval of $[0,1)$ . Each symbol narrows the interval according to its probability. A final binary fraction inside the interval identifies the entire message.

If a message $x_1,\ldots,x_n$ has probability

$P(x_1,\ldots,x_n) = \prod_{i=1}^n p(x_i),$

then its ideal code length is

$-\log_2 P(x_1,\ldots,x_n) = \sum_{i=1}^n -\log_2 p(x_i).$

With careful renormalization and finite-precision arithmetic, arithmetic coding can get within $\epsilon$ bits per symbol of the entropy rate for long messages. The price is more complex encoder and decoder state than a Huffman tree, plus historical patent and implementation concerns. Many modern formats use range coding or tabled variants for similar reasons.

LZ77 and the Sliding Window

Huffman and arithmetic coding exploit symbol bias. LZ77 exploits repeated substrings. It keeps a sliding window of previous bytes. When the next input bytes match earlier bytes, the encoder emits a pair containing a distance backward and a match length. Otherwise it emits a literal byte.

For abcabcabcX, the ASCII bytes are

61 62 63 61 62 63 61 62 63 58
 a  b  c  a  b  c  a  b  c  X

After outputting literals a, b, and c, the encoder sees abcabc repeated starting 3 bytes back. An illustrative LZSS token stream is

literal 0x61
literal 0x62
literal 0x63
match distance 3 length 6
literal 0x58

One toy byte layout could use a flag byte followed by payload bytes, with flag bit 0 for a literal and 1 for a match.

flags   payload
0x08    61 62 63 03 06 58

This toy encoding uses 7 bytes instead of 10, before any file header. DEFLATE, the format inside gzip, does more. It emits literals, length codes, and distance codes, then Huffman-codes those token streams. So gzip combines dictionary substitution from LZ77 with entropy coding over the resulting symbols.

LZSS is a common variant that omits matches unless they save space. A match of length 2 may cost more than two literal bytes after flags and distance fields. Real encoders spend CPU time finding good matches inside a bounded window, commonly 32 KiB for DEFLATE.

When Lossy Beats Lossless

Lossless compression preserves every bit, so it cannot discard measurement noise, perceptually irrelevant detail, or numerical precision that a downstream task does not need. Lossy coding changes the object being represented.

JPEG converts image blocks into frequency coefficients and quantizes them, often removing high-frequency detail that viewers tolerate. MP3 models auditory masking and allocates fewer bits to components that are hard to hear. Neural-network weight quantization stores weights in 8-bit integers, 4-bit integers, or mixed formats instead of 32-bit floats. That is not lossless, but inference accuracy can remain acceptable if the quantization error is controlled.

A lossless compressor applied after lossy coding is still useful. JPEG and MP3 both include entropy coding stages after quantization. For ML models, packed low-bit weights still need byte layouts, alignment, and sometimes entropy coding if distribution skew is high.

Key Result

Theorem

Source Coding Limit for Lossless Compression

Statement

For a discrete memoryless source with entropy $H(X)$ , no lossless code can have expected length below $H(X)$ bits per symbol in the limit without losing injectivity. Prefix codes such as Huffman satisfy $H(X) \leq L < H(X) + 1$ for single-symbol coding. Block and arithmetic codes can make the overhead per symbol smaller than any chosen $\epsilon > 0$ for long enough blocks.

Intuition

A source with entropy $H$ has about $2^{nH}$ typical length- $n$ messages. Distinguishing that many likely messages needs about $nH$ bits. Assigning shorter names to common messages works only because uncommon messages receive longer names.

Proof Sketch

The lower bound follows from Kraft's inequality for prefix codes and Gibbs' inequality. If codeword lengths are $\ell(x)$ , Kraft's inequality requires $\sum_x 2^{-\ell(x)} \leq 1$ . Minimizing expected length $\sum_x p(x)\ell(x)$ under this constraint gives ideal lengths near $-\log_2 p(x)$ , whose expectation is $H(X)$ . Huffman's merge rule constructs the optimal integer-length prefix tree for known single-symbol frequencies. Arithmetic coding avoids per-symbol integer rounding by coding long sequences as intervals.

Why It Matters

The theorem separates modeling from coding. Better compression usually means a better probability model or a better dictionary parse, not a magic backend that defeats entropy.

Failure Mode

The theorem is distributional. It does not say every file shrinks. The pigeonhole count proves the opposite for fixed-length input sets.

report a correction →

Common Confusions

Watch Out

Compression ratio is not a property of the algorithm alone

A gzip ratio measured on English text says little about the same program on encrypted data, JPEG files, or random bytes. Already-compressed formats often have near-uniform byte distributions and few repeated substrings, so gzip has little to exploit.

Watch Out

Entropy of bytes is not always entropy of the data source

If UTF-8 English text is modeled as raw bytes, common ASCII letters dominate. If it is modeled as words with context, the estimated entropy changes. Entropy is computed under a model; a bad model gives a weak bound.

Watch Out

Huffman codes are not self-describing

The decoder must know the code tree or a canonical representation of code lengths. The 28-bit BANANA_BANDANA payload is not decodable by itself unless the codebook is known.

Exercises

ExerciseCore

Problem

Build a Huffman code for counts A: 8, B: 4, C: 2, D: 2. Compute the entropy and the average Huffman length.

ExerciseCore

Problem

Decode the LZSS-style token stream literal 0x61, literal 0x62, literal 0x63, match distance 3 length 6, literal 0x58.

ExerciseAdvanced

Problem

For 8-bit files, prove by counting that at least one 8-bit input must fail to compress if a lossless compressor maps every input to a bit string of length at most 8.

References

Canonical:

Charles Petzold, Code: The Hidden Language of Computer Hardware and Software, 2nd ed. (2022), ch. 7-9, binary codes, bytes, and structured representations
Andrew S. Tanenbaum and Todd Austin, Structured Computer Organization, 6th ed. (2013), ch. 2, data representation at the machine level
Thomas M. Cover and Joy A. Thomas, Elements of Information Theory, 2nd ed. (2006), ch. 5, entropy and lossless source coding
David Salomon and Giovanni Motta, Handbook of Data Compression, 5th ed. (2010), ch. 2-3, statistical coding and dictionary methods
Khalid Sayood, Introduction to Data Compression, 5th ed. (2017), ch. 3-5, Huffman coding, arithmetic coding, and dictionary coding
L. Peter Deutsch, DEFLATE Compressed Data Format Specification version 1.3, RFC 1951 (1996), §§3.2.2-3.2.7, LZ77 tokens and Huffman coding in DEFLATE

Accessible:

David MacKay, Information Theory, Inference, and Learning Algorithms, ch. 4-6, open textbook treatment of entropy and coding
Mark Nelson, The Data Compression Book, 2nd ed., practical explanations of Huffman, LZ77, and related file formats
Matt Mahoney, Data Compression Explained, online notes covering entropy, dictionary coding, and practical compressors

Why This Matters

Core Definitions

Entropy as the Bits-Per-Symbol Floor

No Compressor Wins on Every File

Huffman Coding by Hand

Arithmetic Coding and the Epsilon Gap

LZ77 and the Sliding Window

When Lossy Beats Lossless

Key Result

Common Confusions

Exercises

References

Next Topics