Skip to main content

LLM Construction

Bits, Nats, Perplexity, and BPB

The four units people use to measure language model quality, how they relate to each other, when to use each one, and how mixing them up leads to wrong conclusions.

CoreTier 2StableCore spine~35 min

Why This Matters

Language model papers report performance using at least four different units: cross-entropy in nats, cross-entropy in bits, perplexity, and bits-per-byte (BPB). These measure the same underlying quantity (how well the model predicts the next token) but on different scales, with different conventions, and people mix them up constantly.

If you cannot convert between these units fluently, you will miscompare models, misinterpret scaling laws, and misunderstand benchmark results. This page gives you the exact conversions and tells you which unit to use when.

The Four Units

Definition

Cross-Entropy Loss (Nats)

The standard training loss for language models. For a model qq predicting tokens from a true distribution pp over vocabulary VV:

Hnats(p,q)=vVp(v)logq(v)H_{\text{nats}}(p, q) = -\sum_{v \in V} p(v) \log q(v)

where log\log is the natural logarithm (base ee). This is what PyTorch's CrossEntropyLoss returns. The unit is nats (natural units of information).

A perfect model achieves Hnats=H(p)H_{\text{nats}} = H(p), the entropy of the true distribution. For natural language, this is typically 2 to 5 nats per token depending on the tokenizer and domain.

Definition

Cross-Entropy Loss (Bits)

The same cross-entropy, but using log2\log_2 instead of ln\ln:

Hbits=Hnatsln2Hnats0.6931H_{\text{bits}} = \frac{H_{\text{nats}}}{\ln 2} \approx \frac{H_{\text{nats}}}{0.6931}

The unit is bits (binary digits of information). One nat \approx 1.4427 bits. One bit \approx 0.6931 nats. Information theory traditionally uses bits; machine learning traditionally uses nats.

Definition

Perplexity

The exponential of the cross-entropy:

PPL=eHnats=2Hbits\text{PPL} = e^{H_{\text{nats}}} = 2^{H_{\text{bits}}}

Perplexity has an intuitive interpretation: a perplexity of kk means the model is "as confused as if it were choosing uniformly among kk equally likely tokens at each step."

PPL = 1 means perfect prediction. PPL = V|V| means uniform random guessing over the vocabulary. For modern language models on standard benchmarks, PPL is typically 5 to 30.

Definition

Bits Per Byte (BPB)

Cross-entropy in bits, normalized by the total number of UTF-8 bytes in the evaluated text rather than by the number of tokens. Let LnatsL_{\text{nats}} be the total negative log-likelihood (summed over every token in the corpus, in nats), let NtokN_{\text{tok}} be the total token count, and let NbyteN_{\text{byte}} be the total UTF-8 byte count of the underlying text. Then:

BPB=Lnatsln2Nbyte=LbitsNbyte\text{BPB} = \frac{L_{\text{nats}}}{\ln 2 \cdot N_{\text{byte}}} = \frac{L_{\text{bits}}}{N_{\text{byte}}}

Equivalently, if Hˉbits=Lbits/Ntok\bar{H}_{\text{bits}} = L_{\text{bits}} / N_{\text{tok}} is the average bits per token and b=Nbyte/Ntokb = N_{\text{byte}} / N_{\text{tok}} is bytes per token for this tokenizer and corpus:

BPB=Hˉbitsb\text{BPB} = \frac{\bar{H}_{\text{bits}}}{b}

The loss is summed over tokens (not averaged) before dividing by bytes. BPB normalizes by byte count rather than token count, which makes comparisons across tokenizers valid: two models with different vocabularies produce different token counts for the same text, but the byte count of the text is fixed.

Typical values: 0.5 to 1.0 BPB for strong language models on English text. Shannon (1951) estimated English entropy at roughly 0.6 to 1.3 bits per character via human prediction experiments.

Conversion Table

Proposition

Unit Conversion Identities

Statement

FromToFormula
NatsBitsbits=nats/ln2nats×1.4427\text{bits} = \text{nats} / \ln 2 \approx \text{nats} \times 1.4427
BitsNatsnats=bits×ln2bits×0.6931\text{nats} = \text{bits} \times \ln 2 \approx \text{bits} \times 0.6931
NatsPerplexityPPL=enats\text{PPL} = e^{\text{nats}}
BitsPerplexityPPL=2bits\text{PPL} = 2^{\text{bits}}
PerplexityNatsnats=ln(PPL)\text{nats} = \ln(\text{PPL})
PerplexityBitsbits=log2(PPL)\text{bits} = \log_2(\text{PPL})
Bits (per token)BPBBPB=bits×(tokens/bytes)\text{BPB} = \text{bits} \times (\text{tokens} / \text{bytes})
Total nats lossBPBBPB=Lnats/(ln2Nbyte)\text{BPB} = L_{\text{nats}} / (\ln 2 \cdot N_{\text{byte}})

Intuition

Nats and bits are the same quantity on different logarithmic scales (ee vs 2). Perplexity is the exponential transformation that converts from a logarithmic scale to a linear "effective vocabulary size" scale. BPB is a normalization that removes the tokenizer from the equation.

Why It Matters

Without these conversions, you cannot compare results across papers. A model reporting 3.2 nats is the same as 4.6 bits is the same as PPL 24.5. If you see one paper reporting "perplexity 25" and another reporting "3.2 nats per token," these are the same result.

Failure Mode

The conversion between bits-per-token and BPB depends on the tokenizer's compression ratio (tokens per byte). This ratio varies across tokenizers: GPT-2's BPE produces roughly 0.25 tokens per byte on English text; SentencePiece may differ. You cannot convert between bits-per-token and BPB without knowing the tokenizer.

Worked Example

A GPT-2 model achieves a cross-entropy of 3.4 nats per token on WikiText-103. GPT-2's tokenizer averages about 3.8 bytes per token on this dataset.

MetricValueCalculation
Nats per token3.4(given)
Bits per token4.903.4/0.6931=4.903.4 / 0.6931 = 4.90
Perplexity29.96e3.4=29.96e^{3.4} = 29.96
BPB1.294.90/3.8=1.294.90 / 3.8 = 1.29

If a different model with a different tokenizer (averaging 4.2 bytes per token) achieves 3.1 nats per token, you cannot directly compare their nats or perplexity (different tokenizations). But you can compare BPB: 3.1/0.6931/4.2=1.063.1 / 0.6931 / 4.2 = 1.06 BPB. The second model is better (lower BPB = better compression).

Character-Level Benchmarks: enwik8 and text8

Two byte-level / character-level benchmarks report results in BPB (often called BPC, bits-per-character, when the text is pure ASCII where bytes and characters coincide):

  • enwik8: the first 100 MB of an English Wikipedia XML dump (Matt Mahoney's Hutter Prize corpus). Contains raw XML, markup, and multi-byte UTF-8 characters. SOTA results are reported in BPB.
  • text8: the first 100 MB of enwik8 after stripping XML, lowercasing, and keeping only 26 letters plus space. Pure ASCII, so BPB equals BPC.

Reference results (BPB, lower is better):

Modelenwik8text8Source
Transformer-XL (large)0.991.08Dai et al. 2019
Compressive Transformer0.971.05Rae et al. 2020
GPT-2 (zero-shot)0.93 (WikiText)Radford et al. 2019

These benchmarks exist specifically to standardize comparison: because the text is a fixed byte stream, BPB is tokenizer-agnostic and directly interpretable as compression ratio.

When to Use Each Unit

UnitUse whenAvoid when
NatsTraining (it is the raw loss), internal monitoringComparing across tokenizers
BitsInformation theory context, compression discussionWhen your audience expects nats
PerplexityPaper reporting, intuitive communicationAveraging across datasets (arithmetic mean of PPL is misleading; average the log-PPL instead)
BPBCross-tokenizer comparison, scaling law analysisWhen byte-level normalization obscures token-level behavior

Common Confusions

Watch Out

You cannot average perplexities

If model A has PPL 20 on dataset 1 and PPL 40 on dataset 2, the "average perplexity" is not 30. Perplexity is an exponential quantity. The correct aggregation is: average the log-perplexities (cross-entropies), then exponentiate. The geometric mean of perplexities equals the exponentiation of the arithmetic mean of cross-entropities: PPLavg=exp(12(ln20+ln40))=exp(3.39)29.6\text{PPL}_{\text{avg}} = \exp(\frac{1}{2}(\ln 20 + \ln 40)) = \exp(3.39) \approx 29.6.

Watch Out

Lower perplexity does not always mean a better model

Perplexity measures how well the model predicts the specific test set. A model with lower perplexity on news articles may have higher perplexity on code. Perplexity is domain-specific. Also, a model can achieve low perplexity by being overconfident on easy tokens and terrible on hard tokens. Calibration is a separate and important property.

Watch Out

BPB depends on the text encoding

BPB normalizes by UTF-8 bytes. If your text is mostly ASCII (1 byte per character), BPB is close to bits-per-character. For text with many non-ASCII characters (Chinese, Arabic, emoji), UTF-8 uses 2 to 4 bytes per character, changing the BPB number without changing model quality. Always note the character set when comparing BPB.

Watch Out

Cross-entropy is literally a compression bound

A model with cross-entropy HH bits per symbol can, via arithmetic coding, compress the evaluated text to HH bits per symbol in expectation (plus a vanishing per-sequence overhead). This is the source coding theorem applied to the model's predictive distribution. BPB is therefore not just a loss number: a model at 0.8 BPB on enwik8 compresses the file to 80 MB (100 MB text ×\times 0.8 bits/byte / 8 bits/byte), and no coder using that predictive model can do better. This is why Hutter Prize scoring is phrased as compressed-file-size rather than perplexity. See Cover and Thomas (2006) Ch 5 or MacKay (2003) Ch 4 to 6.

Watch Out

Averaging perplexity across documents is token-weighted by convention

When a benchmark reports a single perplexity over a corpus of many documents, the standard convention (used by lm-evaluation-harness, HuggingFace evaluate, and most papers) is token-weighted, not document-weighted. Let LiL_i be the per-token NLL on document ii and nin_i its token count. Then the reported perplexity is:

PPL=exp(iLiniini)\text{PPL} = \exp\left(\frac{\sum_i L_i \cdot n_i}{\sum_i n_i}\right)

This is equivalent to concatenating all documents and computing a single corpus-level cross-entropy. Document-weighted averaging (giving a 10-token doc the same weight as a 10000-token doc) produces different numbers and is almost never what you want. Always check which convention a paper uses if the corpus contains documents of very unequal length.

Exercises

ExerciseCore

Problem

A language model achieves a cross-entropy of 2.1 nats per token. Convert this to bits per token and perplexity. If the tokenizer averages 4.0 bytes per token, what is the BPB?

ExerciseCore

Problem

Paper A reports PPL = 15.2 on WikiText-103 using a BPE tokenizer. Paper B reports PPL = 18.7 on WikiText-103 using a unigram tokenizer. Can you conclude that model A is better? What additional information do you need?

References

Canonical:

  • Shannon, C. E., "Prediction and Entropy of Printed English," Bell System Technical Journal 30:50-64 (1951). Introduces bits-per-character (BPC) for English text, estimating entropy at roughly 0.6 to 1.3 bits per character via human prediction. The historical origin of the metric.
  • Cover, T. M. and Thomas, J. A., Elements of Information Theory (2nd ed., 2006), Chapters 2 (entropy, KL) and 5 (source coding theorem, arithmetic coding). The compression interpretation of cross-entropy.
  • MacKay, D. J. C., Information Theory, Inference, and Learning Algorithms (2003), Chapters 4 to 6. Source coding, arithmetic coding, and compression bounds derived from predictive distributions. Free PDF at www.inference.org.uk/mackay/itila.
  • Jelinek, F., Statistical Methods for Speech Recognition (1997). Early use of perplexity in language modeling.
  • Brown, P. F. et al., "An Estimate of an Upper Bound for the Entropy of English," Computational Linguistics 18(1):31-40 (1992). Refines Shannon's entropy estimate using a trigram model on a large corpus.

Benchmarks:

  • Mahoney, M., "Large Text Compression Benchmark" (2006, updated). Defines enwik8 (first 10810^8 bytes of English Wikipedia XML dump) and text8 (cleaned lowercase ASCII variant). http://mattmahoney.net/dc/textdata.html
  • Dai, Z. et al., "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context," ACL 2019 (arXiv:1901.02860). Reports 0.99 BPB on enwik8.
  • Rae, J. W. et al., "Compressive Transformers for Long-Range Sequence Modelling," ICLR 2020 (arXiv:1911.05507). Reports 0.97 BPB on enwik8.
  • Radford, A. et al., "Language Models are Unsupervised Multitask Learners" (2019). GPT-2 technical report, zero-shot perplexity and BPB on multiple benchmarks.

Current practice:

  • Gao, L. et al., "The Pile: An 800GB Dataset of Diverse Text for Language Modeling," arXiv:2101.00027 (2020). Uses BPB for cross-tokenizer comparison.
  • Hoffmann, J. et al., "Training Compute-Optimal Large Language Models," arXiv:2203.15556 (2022). Chinchilla reports both nats and BPB.
  • EleutherAI, lm-evaluation-harness, github.com/EleutherAI/lm-evaluation-harness. Reference implementation of token-weighted corpus perplexity.

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2