Efficient numeracy in language models through single-token number embeddings
- URL: http://arxiv.org/abs/2510.06824v1
- Date: Wed, 08 Oct 2025 09:48:11 GMT
- Title: Efficient numeracy in language models through single-token number embeddings
- Authors: Linus Kreitner, Paul Hager, Jonathan Mengedoht, Georgios Kaissis, Daniel Rueckert, Martin J. Menten,
- Abstract summary: We show that frontier language models (LLMs) require excessive amounts of reasoning tokens to solve even basic calculations.<n>We propose BitTokens, a novel tokenization strategy that embeds any number into a single token using its IEEE 754 binary floating-point representation.
- Score: 29.22332468518668
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To drive progress in science and engineering, large language models (LLMs) must be able to process large amounts of numerical data and solve long calculations efficiently. This is currently only possible through the use of external tools or extensive reasoning chains, either limiting the numerical intuition of LLMs or limiting the length of problems they can solve. We show that frontier LLMs require excessive amounts of reasoning tokens to solve even basic calculations, which is exacerbated by their tokenization strategies that split single numbers into multiple tokens. This motivates the need for efficient and effective single-token number encodings. We introduce a set of desiderata for such encodings and show that existing approaches fail to fulfill them. To address these shortcomings, we propose BitTokens, a novel tokenization strategy that embeds any number into a single token using its IEEE 754 binary floating-point representation. Through extensive experiments we show that our BitTokens allow even small language models to learn algorithms that solve basic arithmetic operations nearly perfectly. This newly gained efficiency could expand the length and complexity of problems language models can solve.
Related papers
- Token Sugar: Making Source Code Sweeter for LLMs through Token-Efficient Shorthand [12.853934439806908]
We propose Token Sugar, a concept that replaces frequent and verbose code patterns with reversible, token-efficient shorthand in the source code.<n>With this solution, we obtain 799 (code pattern, shorthand) pairs, which can reduce up to 15.1% token count in the source code.<n> Experimental results show that these models not only achieve significant token savings (up to 11.2% reduction) during generation but also maintain near-identical Pass@1 scores compared to baselines trained on unprocessed code.
arXiv Detail & Related papers (2025-12-09T05:42:23Z) - Sampling from Your Language Model One Byte at a Time [82.71473348639489]
Tokenization can introduce distortion into the model's generations, known as the Prompt Boundary Problem (PBP)<n>We present an inference-time method to convert any autore LM with a BPE tokenizer into a character-level or byte-level LM.<n>Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers.
arXiv Detail & Related papers (2025-06-17T02:37:04Z) - Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - FoNE: Precise Single-Token Number Embeddings via Fourier Features [51.17846016593835]
We propose a novel method that maps numbers into the embedding space with their Fourier features.<n>FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation.<n>On 6-digit decimal addition, FoNE requires 64$times$ less data to achieve 99% accuracy than subword and digit-wise embeddings.<n>FoNE is the only method that yields 100% accuracy on over 100,000 test examples for addition, subtraction, and multiplication.
arXiv Detail & Related papers (2025-02-13T19:54:59Z) - Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [53.57895922042783]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.<n>We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z) - Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection [49.15148871877941]
Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs)<n>We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length.<n>PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
arXiv Detail & Related papers (2025-01-07T17:00:49Z) - Regress, Don't Guess -- A Regression-like Loss on Number Tokens for Language Models [2.5464748274973026]
We present a regression-like loss that operates purely on token level.<n>Our proposed Number Token Loss (NTL) comes in two flavors and minimizes either the $L_p$ norm or the Wasserstein distance.<n>We evaluate the proposed scheme on various mathematical datasets and find that it consistently improves performance in math-related tasks.
arXiv Detail & Related papers (2024-11-04T13:43:24Z) - ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens.<n>During inference, ElasticTok can dynamically allocate tokens when needed.<n>Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z) - Let's Think Dot by Dot: Hidden Computation in Transformer Language Models [30.972412126012884]
Chain-of-thought responses from language models improve performance across most benchmarks.
We show that transformers can use meaningless filler tokens in place of a chain of thought to solve two hard algorithmic tasks.
We find that learning to use filler tokens is difficult and requires specific, dense supervision to converge.
arXiv Detail & Related papers (2024-04-24T09:30:00Z) - Tokenization counts: the impact of tokenization on arithmetic in
frontier LLMs [3.6722413665749674]
Tokenization is the division of input text into input tokens.
We study the effect this choice has on numerical reasoning through the use of arithmetic tasks.
arXiv Detail & Related papers (2024-02-22T18:14:09Z) - Identifying and Analyzing Performance-Critical Tokens in Large Language Models [52.404072802235234]
We study how large language models learn to perform tasks from demonstrations.<n>Our work sheds light on how large language models learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in large language models.
arXiv Detail & Related papers (2024-01-20T20:55:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.