You Can Learn Tokenization End-to-End with Reinforcement Learning
- URL: http://arxiv.org/abs/2602.13940v1
- Date: Sun, 15 Feb 2026 00:31:24 GMT
- Title: You Can Learn Tokenization End-to-End with Reinforcement Learning
- Authors: Sam Dauncey, Roger Wattenhofer,
- Abstract summary: Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs)<n>We show that these token boundaries can instead be learned using score function estimates, which have tighter theoretical guarantees.<n>We demonstrate that the resultant method outperforms prior proposed straight-through estimates, both qualitatively and quantitatively.
- Score: 34.662213518530315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general trend towards architectures becoming increasingly end-to-end. Prior work has shown promising results at scale in bringing this compression step inside the LLMs' architecture with heuristics to draw token boundaries, and also attempts to learn these token boundaries with straight-through estimates, which treat the problem of drawing discrete token boundaries as a continuous one. We show that these token boundaries can instead be learned using score function estimates, which have tighter theoretical guarantees due to directly optimizing the problem of drawing discrete token boundaries to minimize loss. We observe that techniques from reinforcement learning, such as time discounting, are necessary to reduce the variance of this score function sufficiently to make it practicable. We demonstrate that the resultant method outperforms prior proposed straight-through estimates, both qualitatively and quantitatively at the $100$ million parameter scale.
Related papers
- Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning [62.680551162054975]
We introduce an end-to-end framework where LLMs learn to self-regulate the granularity of the reasoning steps through dynamic summarization.<n>We apply reinforcement learning to incentivize this capability further, uncovering a critical insight: the accuracy gap between the highly efficient Fold mode and the exhaustive Unfold mode progressively narrows.<n>Our Accordion-Thinker demonstrates that with learned self-compression, LLMs can tackle complex reasoning tasks with minimal dependency token overhead.
arXiv Detail & Related papers (2026-02-03T08:34:20Z) - SOMBRERO: Measuring and Steering Boundary Placement in End-to-End Hierarchical Sequence Models [10.547898683606569]
We introduce a router-agnostic metric of boundary quality, boundary enrichment B, which measures how strongly chunk starts concentrate on positions with high next-byte surprisal.<n>We propose Sombrero, which steers learning toward predictive difficulty via a confidence-alignment boundary loss and stabilizes learning by applying confidence-off and accuracy-weighted trade smoothing.
arXiv Detail & Related papers (2026-01-30T10:34:07Z) - Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models [16.540220733551823]
Large Vision-Language Models (VLMs) enable strong multimodal reasoning but incur heavy inference costs from redundant visual tokens.<n> Attention-based methods rely on raw attention scores, which are often unstable across layers and heads.<n>We propose ours, a training-free framework built on a simple intuition.
arXiv Detail & Related papers (2025-09-29T14:20:05Z) - Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z) - Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - Near-Optimal Solutions of Constrained Learning Problems [85.48853063302764]
In machine learning systems, the need to curtail their behavior has become increasingly apparent.
This is evidenced by recent advancements towards developing models that satisfy dual robustness variables.
Our results show that rich parametrizations effectively mitigate non-dimensional, finite learning problems.
arXiv Detail & Related papers (2024-03-18T14:55:45Z) - Primal Dual Continual Learning: Balancing Stability and Plasticity through Adaptive Memory Allocation [86.8475564814154]
We show that it is both possible and beneficial to undertake the constrained optimization problem directly.
We focus on memory-based methods, where a small subset of samples from previous tasks can be stored in a replay buffer.
We show that dual variables indicate the sensitivity of the optimal value of the continual learning problem with respect to constraint perturbations.
arXiv Detail & Related papers (2023-09-29T21:23:27Z) - Certified Robustness via Dynamic Margin Maximization and Improved Lipschitz Regularization [40.2635560771494]
We develop a robust training algorithm to increase the margin in the output (logit) space while regularizing the Lipschitz constant of the model along vulnerable directions.<n>The relative accuracy of the bounds prevents excessive regularization and allows for more direct manipulation of the decision boundary.<n> Experiments on the MNIST, CIFAR-10, and Tiny-ImageNet data sets verify that our proposed algorithm obtains competitively improved results compared to the state-of-the-art.
arXiv Detail & Related papers (2023-09-29T20:07:02Z) - Uncertainty quantification for learned ISTA [5.706217259840463]
Algorithm unrolling schemes stand out among these model-based learning techniques.
They lack certainty estimates and a theory for uncertainty quantification is still elusive.
This work proposes a rigorous way to obtain confidence intervals for the LISTA estimator.
arXiv Detail & Related papers (2023-09-14T18:39:07Z) - Scalable Bayesian Meta-Learning through Generalized Implicit Gradients [64.21628447579772]
Implicit Bayesian meta-learning (iBaML) method broadens the scope of learnable priors, but also quantifies the associated uncertainty.
Analytical error bounds are established to demonstrate the precision and efficiency of the generalized implicit gradient over the explicit one.
arXiv Detail & Related papers (2023-03-31T02:10:30Z) - Exact Asymptotics for Linear Quadratic Adaptive Control [6.287145010885044]
We study the simplest non-bandit reinforcement learning problem: linear quadratic control (LQAC)
We derive expressions for the regret, estimation error, and prediction error of a stepwise-updating LQAC algorithm.
In simulations on both stable and unstable systems, we find that our theory also describes the algorithm's finite-sample behavior remarkably well.
arXiv Detail & Related papers (2020-11-02T22:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.