Error Detection and Correction

Why This Matters

A 64-bit server memory word often travels with 8 extra check bits, making a 72-bit physical word. That 12.5 percent overhead lets ECC RAM correct one flipped bit in the word and detect two. Without it, a bit flipped by a charged particle, thermal noise, RF interference, or DRAM cell wear can turn a tensor value, a page-table entry, or a model checkpoint byte into a different value with no software exception.

Networks and disks face longer error patterns. Ethernet appends a 32-bit CRC to each frame. ZIP stores a CRC-32 per file entry. TCP, by contrast, uses a 16-bit one's-complement checksum rather than a CRC. The design question is quantitative: how many extra bits buy detection, correction, or both?

Core Definitions

Definition

Block code

A block code maps $k$ information symbols to an $n$ -symbol codeword. For a binary code, the rate is $R = k/n$ , the redundancy is $n-k$ , and the code contains at most $2^k$ valid codewords inside the $2^n$ possible bit strings.

Definition

Hamming distance

The Hamming distance $d(x,y)$ between two equal-length strings is the count of positions where they differ. The minimum distance of a code $C$ is $d_{\min} = \min_{x \neq y \in C} d(x,y)$ .

Definition

Parity bit

A parity bit is one check bit chosen so the total number of 1 bits is even or odd. Even parity detects every odd number of flips in the protected group and corrects no bit by itself.

Definition

Cyclic redundancy check

A CRC treats a bit string as a polynomial over $\mathbb{F}_2$ , divides by a fixed generator polynomial, and appends the remainder. Receiver and sender agree on the generator, bit order, initial value, and final xor.

Bits Flip for Physical Reasons

A stored 1 and a stored 0 are voltage, charge, magnetization, or phase states. Noise moves the measured value. Cosmic rays and alpha particles deposit charge in silicon. Thermal noise perturbs small signals. RF interference couples into wires. Flash cells wear out because erase cycles damage the oxide; reads and retention time also move thresholds.

A single bit flip can have very different meanings by location. Suppose the byte 0x41 stores ASCII A.

0x41 = 0100 0001
flip bit 5 from 0 to 1
0x61 = 0110 0001 = ASCII 'a'

For a 32-bit IEEE 754 float, flipping an exponent bit can be much larger.

1.0f = 0x3f800000
bits = 00111111 10000000 00000000 00000000

flip exponent bit 30
0x7f800000 = +infinity
bits = 01111111 10000000 00000000 00000000

This is why storage and transmission formats carry check information. They do not make hardware immune to errors; they make many errors visible and some errors correctable.

Parity Detects Odd Counts, Not Locations

For even parity, append $p = b_0 \oplus b_1 \oplus \cdots \oplus b_{m-1}$ . The receiver recomputes the xor over data and parity. A nonzero result means an odd number of protected bits changed.

Worked byte example:

data byte:       10110010
number of 1s:   4
even parity p:  0
stored bits:    10110010 0

one flip:        10110000 0
number of 1s:   3
parity check:   fail

two flips:       10100000 0
number of 1s:   2
parity check:   pass

Parity gives a one-bit syndrome, so it cannot name which of the 9 stored positions flipped. It catches all 1, 3, 5, 7, and 9 bit errors in this group. It misses some 2, 4, 6, and 8 bit errors.

The same xor appears in machine code because addition in $\mathbb{F}_2$ is xor. A compact even-parity check for one byte is:

#include <stdint.h>

int even_parity_ok(uint8_t data, unsigned parity_bit) {
    uint8_t x = data;
    x ^= x >> 4;
    x ^= x >> 2;
    x ^= x >> 1;
    return ((x ^ parity_bit) & 1u) == 0u;
}

Hamming(7,4) Encoding and Decoding

Hamming(7,4) uses positions 1, 2, and 4 for parity, and positions 3, 5, 6, and 7 for data. With even parity:

$p_1 \text{ covers positions } 1,3,5,7$

$p_2 \text{ covers positions } 2,3,6,7$

$p_4 \text{ covers positions } 4,5,6,7$

Encode the 4-bit message 1011 as $d_1=1$ , $d_2=0$ , $d_3=1$ , $d_4=1$ placed at positions 3, 5, 6, 7.

position:  1 2 3 4 5 6 7
role:     p1 p2 d1 p4 d2 d3 d4
data:           1     0  1  1

p1 = pos3 xor pos5 xor pos7 = 1 xor 0 xor 1 = 0
p2 = pos3 xor pos6 xor pos7 = 1 xor 1 xor 1 = 1
p4 = pos5 xor pos6 xor pos7 = 0 xor 1 xor 1 = 0

codeword: 0 1 1 0 0 1 1 = 0110011

Now flip position 6 during storage or transmission.

sent:     0 1 1 0 0 1 1
received: 0 1 1 0 0 0 1
position: 1 2 3 4 5 6 7

The syndrome bits recompute the parity checks:

s1 = r1 xor r3 xor r5 xor r7 = 0 xor 1 xor 0 xor 1 = 0
s2 = r2 xor r3 xor r6 xor r7 = 1 xor 1 xor 0 xor 1 = 1
s4 = r4 xor r5 xor r6 xor r7 = 0 xor 0 xor 0 xor 1 = 1

syndrome as binary s4 s2 s1 = 110 = 6

The syndrome names the flipped position. Flip received position 6 back to 1, yielding 0110011, then read data positions 3, 5, 6, 7 as 1011.

Plain Hamming(7,4) has minimum distance 3. It corrects one error. It detects two errors if the receiver only asks whether the syndrome is nonzero, but a correcting decoder will miscorrect a two-bit error to some third position. SECDED in memory is usually the extended Hamming code: Hamming(7,4) plus one overall parity bit, making an (8,4) code with minimum distance 4.

Extended decoding uses this table:

syndrome = 0, overall parity ok:      no error
syndrome = 0, overall parity bad:     overall parity bit flipped
syndrome != 0, overall parity bad:    correct bit named by syndrome
syndrome != 0, overall parity ok:     double error detected, do not correct

Example double error in positions 6 and 7 of 0110011 gives syndrome 001, but the overall parity remains even. The extended decoder reports a double error instead of flipping position 1.

CRCs Use Polynomial Division Mod 2

A CRC replaces integer division with polynomial division over $\mathbb{F}_2$ . Addition and subtraction are both xor. For generator $g(x)=x^3+x+1$ , the binary divisor is 1011. For message bits 110100, append three zeros, divide by 1011, and append the 3-bit remainder.

message:          110100
append 3 zeros:   110100000
generator:        1011
remainder:        100
transmitted:      110100100

A tiny bit-oriented implementation shows the operation. It is slow, but it matches the arithmetic.

#include <stdint.h>

uint32_t crc3_1011(uint32_t bits, int len) {
    uint32_t reg = bits << 3;
    uint32_t poly = 0b1011u;

    for (int i = len + 2; i >= 3; --i) {
        if ((reg >> i) & 1u) {
            reg ^= poly << (i - 3);
        }
    }
    return reg & 0b111u;
}

If the receiver divides 110100100 by 1011, the remainder is zero. If a burst flips the contiguous pattern 001110000, the received word is 111010100; its remainder is nonzero for this generator, so the error is detected.

Real CRCs use wider polynomials and fixed conventions. Ethernet uses a 32-bit CRC over the frame check sequence, ZIP records CRC-32 per entry, and many storage protocols use CRC variants. TCP is different: its 16-bit checksum catches many random corruptions but is not polynomial CRC division.

ECC RAM and Reed-Solomon Storage Codes

ECC DRAM commonly protects each 64-bit data word with 8 check bits. That is 72 physical bits per 64 data bits:

$\text{overhead} = \frac{8}{64} = 0.125 = 12.5\%$

On a corrected single-bit event, the memory controller can return the corrected data and log the address. On an uncorrectable multi-bit event, the machine can raise a machine-check exception or poison the cache line, depending on platform behavior. Servers care because a long-running job touches terabytes of memory over hours or days, and silent corruption is worse than a visible crash for databases, training runs, and checkpointed simulations.

Reed-Solomon codes work over larger symbols, often bytes, rather than individual bits. An RS $(n,k)$ code has minimum distance $d=n-k+1$ , so it can correct $\lfloor(n-k)/2\rfloor$ unknown symbol errors or recover up to $n-k$ known erasures. This symbol view is why Reed-Solomon fits CDs, DVDs, QR codes, and RAID-like storage. A scratch may corrupt many adjacent bits, but it may occupy a smaller number of byte symbols after interleaving. RAID-6 uses two parity blocks per stripe and can survive two disk erasures because the failed disk locations are known.

A numeric storage sketch:

RS(10,8) over bytes
data symbols:      8
check symbols:     2
minimum distance:  d = 10 - 8 + 1 = 3
unknown errors:    floor((3 - 1) / 2) = 1
known erasures:    d - 1 = 2

Full Reed-Solomon decoding uses finite-field algebra beyond this page. The design lesson is still simple: coding by symbols converts long bit bursts into a smaller count of symbol errors or erasures.

Key Result

Theorem

Distance, Hamming Bound, and Singleton Bound

Statement

A code with minimum distance $d$ detects up to $d-1$ symbol errors and corrects up to $\lfloor(d-1)/2\rfloor$ symbol errors. If $M=q^k$ , then any such code satisfies the Hamming sphere-packing bound

$q^k \sum_{i=0}^{t} {n \choose i}(q-1)^i \leq q^n$

and the Singleton bound

$k \leq n-d+1.$

Intuition

Detection fails only when an error moves one valid codeword exactly onto another valid codeword. Correction fails when one received word is equally near, or nearer, to two different codewords. Balls of radius $t$ around codewords must not overlap.

Proof Sketch

If fewer than $d$ positions change, a codeword cannot become a different codeword, so detection works through $d-1$ errors. If $2t < d$ , the radius- $t$ balls around distinct codewords are disjoint, so nearest-neighbor decoding is unique. Counting all words in these disjoint balls gives the Hamming bound. For Singleton, delete $d-1$ coordinates from every codeword. Distinct codewords must remain distinct, or two original codewords would have differed in at most $d-1$ positions. Thus $q^k \leq q^{n-d+1}$ .

Why It Matters

The bounds price redundancy. Hamming(7,4) has $n=7$ , $k=4$ , $d=3$ . The Singleton bound gives $4 \leq 5$ , so it is not maximum-distance separable. The binary Hamming bound with $t=1$ gives $2^4(1+7)=128=2^7$ , so the single-error correction spheres exactly fill the 7-bit space.

Failure Mode

A distance-3 code cannot be SECDED under a correcting decoder. A two-bit error may produce the same syndrome as a one-bit error. SECDED needs distance 4, which extended Hamming provides by adding the overall parity bit.

report a correction →

Common Confusions

Watch Out

Hamming(7,4) versus SECDED

Hamming(7,4) corrects one error and has $d=3$ . It can notice that some double errors occurred if you only check for a nonzero syndrome, but a normal correcting decoder will map a double error to a wrong single-bit correction. SECDED uses the extended (8,4) code with an overall parity bit, giving $d=4$ .

Watch Out

CRC versus checksum

Ethernet and ZIP use CRC-32 variants. TCP uses a 16-bit one's-complement checksum over a pseudo-header, TCP header, and payload. Calling TCP's checksum a CRC is incorrect and leads to wrong assumptions about burst-error detection.

Watch Out

Detection is not correction

A parity bit may tell you that a byte plus parity is wrong, but it gives no location. To correct a single-bit error among 7 positions, Hamming(7,4) needs 3 syndrome bits because $2^3$ outcomes cover no-error plus 7 possible error locations.

Exercises

ExerciseCore

Problem

Encode the 4-bit message 0110 using even-parity Hamming(7,4) with parity positions 1, 2, and 4. Then suppose position 5 flips. Compute the received word, syndrome, corrected word, and decoded message.

ExerciseCore

Problem

For a binary code with $n=15$ and $k=11$ , use the Hamming bound to test whether single-error correction with $t=1$ is possible without violating the counting bound. Then compute the Singleton upper bound on $d$ .

ExerciseAdvanced

Problem

Using generator 1011, compute the CRC remainder for message 101100. Give the transmitted bit string.

References

Canonical:

Charles Petzold, Code: The Hidden Language of Computer Hardware and Software (2nd ed., 2022), ch. 7-9, binary codes, telegraphy, and error-checking ideas
Andrew S. Tanenbaum and Todd Austin, Structured Computer Organization (6th ed., 2013), ch. 2, data representation, memory organization, and error-correcting codes
Shu Lin and Daniel J. Costello Jr., Error Control Coding (2nd ed., 2004), ch. 1-3, block codes, bounds, Hamming codes, and cyclic codes
F. J. MacWilliams and N. J. A. Sloane, The Theory of Error-Correcting Codes (1977), ch. 1, Hamming distance, bounds, and introductory code constructions
W. Wesley Peterson and E. J. Weldon Jr., Error-Correcting Codes (2nd ed., 1972), ch. 6-7, cyclic codes and BCH/Reed-Solomon background

Accessible:

Ross N. Williams, A Painless Guide to CRC Error Detection Algorithms
Richard W. Hamming, “Error Detecting and Error Correcting Codes,” Bell System Technical Journal 29(2), 1950
Chris Heegard and Stephen B. Wicker, Turbo Coding, ch. 1, historical error-control coding overview

Next Topics

/computationpath/finite-fields
/computationpath/network-packets-checksums
/computationpath/storage-systems-raid
/computationpath/channel-coding-overview