FuguReport

Compute Optimal Tokenization

Authors Tomasz Limisiewicz, Artidoro Pagnoni, Srini Iyer, Mike Lewis, Sachin Mehta, Alisa Liu, Margaret Li, Gargi Ghosh, Luke Zettlemoyer
Affiliations Meta / University of Washington
Categories Method / Tokenization / Optimal tokenization for model scaling, Evaluation / Model Scaling / Scaling behavior relative to data size, Task / Data Compression / Compression speed and information granularity
License CC BY 4.0

Abstract Overview

This paper investigates how token compression rate (average bytes per token) affects compute-optimal scaling behavior in language models. The authors train 988 latent-tokenized BLT models ranging from 50M to 7B parameters across compute budgets from 5×10^18 to 2×10^21 FLOPs, fitting compression-aware scaling laws for optimal data size, model size, and loss. The key finding is that in compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes rather than tokens, meaning the byte-to-parameter ratio remains approximately constant across compression rates. The paper also examines subword tokenizers and multilingual settings, finding similar scaling behavior across tokenization families and language-dependent optimal compression rates.

Novelty

The work introduces and empirically fits scaling laws that explicitly account for token compression rate, rather than treating token count as the fundamental data unit. It provides a controlled study across both latent and subword tokenization using 988+320 models, plus multilingual experiments linking optimal compression rate to cross-lingual parity.

Results

For English BLT models, the fitted scaling exponents (α=0.465, β=0.471, both near 0.5) confirm that the optimal bytes-per-parameter ratio remains approximately constant (~60 bytes per parameter) across compute budgets and compression rates. The optimal compression rate is non-monotonic in performance and slowly decreases with scale (T*≈3.69 at 10^20 FLOPs, T*≈3.33 at 2×10^21 FLOPs), with subword models showing similar trends. In multilingual experiments, both the optimal bytes-per-parameter ratio and optimal compression rate vary by language and correlate with cross-lingual parity, while popular multilingual tokenizers are found to over-compress some high-resource languages and under-compress some lower-resource ones.

Key Points

  1. Compute-optimal scaling is better expressed in bytes per parameter than in tokens per parameter, because the fitted scaling exponents (α≈0.465, β≈0.471) indicate that the optimal byte-to-parameter ratio (~60 for English) stays nearly constant across compression settings.
  2. There is an interior optimal compression rate rather than a monotonic preference for more compression, and this optimal rate decreases modestly as training compute increases (e.g., from T*≈3.69 at 10^20 FLOPs to T*≈3.33 at 2×10^21 FLOPs).
  3. The same qualitative trends hold for both latent and subword tokenization, while multilingual results show that optimal compression rate is language-dependent, correlates with parity, and diverges from the compression rates achieved by popular BPE tokenizers.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.