Tokenization counts: the impact of tokenization on arithmetic in
frontier LLMs
- URL: http://arxiv.org/abs/2402.14903v1
- Date: Thu, 22 Feb 2024 18:14:09 GMT
- Title: Tokenization counts: the impact of tokenization on arithmetic in
frontier LLMs
- Authors: Aaditya K. Singh, DJ Strouse
- Abstract summary: Tokenization is the division of input text into input tokens.
We study the effect this choice has on numerical reasoning through the use of arithmetic tasks.
- Score: 3.6722413665749674
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tokenization, the division of input text into input tokens, is an often
overlooked aspect of the large language model (LLM) pipeline and could be the
source of useful or harmful inductive biases. Historically, LLMs have relied on
byte pair encoding, without care to specific input domains. With the increased
use of LLMs for reasoning, various number-specific tokenization schemes have
been adopted, with popular models like LLaMa and PaLM opting for single-digit
tokenization while GPT-3.5 and GPT-4 have separate tokens for each 1-, 2-, and
3-digit numbers. In this work, we study the effect this choice has on numerical
reasoning through the use of arithmetic tasks. We consider left-to-right and
right-to-left tokenization for GPT-3.5 and -4, finding that right-to-left
tokenization (enforced by comma separating numbers at inference time) leads to
largely improved performance. Furthermore, we find that model errors when using
standard left-to-right tokenization follow stereotyped error patterns,
suggesting that model computations are systematic rather than approximate. We
show that the model is able to convert between tokenizations easily, thus
allowing chain-of-thought-inspired approaches to recover performance on
left-to-right tokenized inputs. We also find the gap between tokenization
directions decreases when models are scaled, possibly indicating that larger
models are better able to override this tokenization-dependent inductive bias.
In summary, our work performs the first study of how number tokenization
choices lead to differences in model performance on arithmetic tasks,
accompanied by a thorough analysis of error patterns. We hope this work
inspires practitioners to more carefully ablate number tokenization-related
choices when working towards general models of numerical reasoning.
Related papers
- Demystifying Singular Defects in Large Language Models [61.98878352956125]
In large language models (LLMs), the underlying causes of high-norm tokens remain largely unexplored.
We provide both theoretical insights and empirical validation across a range of recent models.
We showcase two practical applications of these findings: the improvement of quantization schemes and the design of LLM signatures.
arXiv Detail & Related papers (2025-02-10T20:09:16Z) - Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [44.84219266082269]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.
We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z) - Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection [49.15148871877941]
Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs)
We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length.
PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
arXiv Detail & Related papers (2025-01-07T17:00:49Z) - Enhancing Item Tokenization for Generative Recommendation through Self-Improvement [67.94240423434944]
Generative recommendation systems are driven by large language models (LLMs)
Current item tokenization methods include using text descriptions, numerical strings, or sequences of discrete tokens.
We propose a self-improving item tokenization method that allows the LLM to refine its own item tokenizations during training process.
arXiv Detail & Related papers (2024-12-22T21:56:15Z) - Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [53.51560766150442]
Critical tokens are elements within reasoning trajectories that significantly influence incorrect outcomes.
We present a novel framework for identifying these tokens through rollout sampling.
We show that identifying and replacing critical tokens significantly improves model accuracy.
arXiv Detail & Related papers (2024-11-29T18:58:22Z) - Regress, Don't Guess -- A Regression-like Loss on Number Tokens for Language Models [2.5346260093097017]
We present two versions of a number token loss for language models.
The first is based on an $L_p$ loss between the ground truth token value and the weighted sum of the predicted class probabilities.
The second loss minimizes the Wasserstein-1 distance between the distribution of the predicted output probabilities and the ground truth distribution.
arXiv Detail & Related papers (2024-11-04T13:43:24Z) - Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LM)
This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts.
We develop a next-byte sampling algorithm that eliminates tokenization bias without requiring further training or optimization.
arXiv Detail & Related papers (2024-10-11T23:30:42Z) - Subtle Errors Matter: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)
RISE injects predefined subtle errors into partial tokens of correct solutions to construct hard pairs for error mitigation.
Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH.
arXiv Detail & Related papers (2024-10-09T07:43:38Z) - Understanding and Mitigating Tokenization Bias in Language Models [6.418593476658017]
State-of-the-art language models are autoregressive and operate on subword units known as tokens.
We show that popular encoding schemes induce a sampling bias that cannot be mitigated with more training or data.
We propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data.
arXiv Detail & Related papers (2024-06-24T17:38:02Z) - You should evaluate your language model on marginal likelihood
overtokenisations [5.824498637088864]
We argue that language models should be evaluated on their marginal likelihood over tokenisations.
We evaluate pretrained English and German language models on both the one-best-tokenisation and marginal perplexities.
arXiv Detail & Related papers (2021-09-06T15:37:02Z) - ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.