Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning
- URL: http://arxiv.org/abs/2512.23087v1
- Date: Sun, 28 Dec 2025 21:44:07 GMT
- Title: Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning
- Authors: Yingru Li, Jiawei Xu, Jiacai Liu, Yuxuan Tong, Ziniu Li, Tianle Cai, Ge Zhang, Qian Liu, Baoxiang Wang,
- Abstract summary: We show that inference engines and numerically-precise training systems produce different probability distributions from the same parameters, creating a training-inference mismatch.<n>By pruning such tokens, we trade large, systematically biased mismatches for a small, bounded optimization bias.
- Score: 35.41241409574854
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning for large language models (LLMs) faces a fundamental tension: high-throughput inference engines and numerically-precise training systems produce different probability distributions from the same parameters, creating a training-inference mismatch. We prove this mismatch has an asymmetric effect: the bound on log-probability mismatch scales as $(1-p)$ where $p$ is the token probability. For high-probability tokens, this bound vanishes, contributing negligibly to sequence-level mismatch. For low-probability tokens in the tail, the bound remains large, and moreover, when sampled, these tokens exhibit systematically biased mismatches that accumulate over sequences, destabilizing gradient estimation. Rather than applying post-hoc corrections, we propose constraining the RL objective to a dynamically-pruned ``safe'' vocabulary that excludes the extreme tail. By pruning such tokens, we trade large, systematically biased mismatches for a small, bounded optimization bias. Empirically, our method achieves stable training; theoretically, we bound the optimization bias introduced by vocabulary pruning.
Related papers
- Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning [17.384089089363382]
We identify a root cause that existing methods overlook: the uniform penalization of errors.<n>Current approaches treat all incorrect rollouts within a group identically.<n>We propose the Asymmetric Confidence-aware Error Penalty (ACE)
arXiv Detail & Related papers (2026-02-24T22:46:43Z) - Probability-Entropy Calibration: An Elastic Indicator for Adaptive Fine-tuning [55.2818264614932]
RankTuner introduces a probability--entropy calibration signal, the Relative Rank Indicator, which compares the rank of the ground-truth token with its expected rank under the prediction distribution.<n>The inverse indicator is used as a token-wise Relative Scale to reweight the fine-tuning objective, focusing updates on truly under-learned tokens.
arXiv Detail & Related papers (2026-02-02T07:27:19Z) - Learning from N-Tuple Data with M Positive Instances: Unbiased Risk Estimation and Theoretical Guarantees [33.15955234458642]
Weakly supervised learning often operates with coarse aggregate signals rather than labels.<n>We show that counts admit a trainable unbiased risk estimator (URE) by linking the instance-generation process to latent marginals.<n>We demonstrate that count-only supervision can be exploited effectively through a theoretically grounded and practically stable objective setting.
arXiv Detail & Related papers (2025-10-21T08:28:07Z) - BAPE: Learning an Explicit Bayes Classifier for Long-tailed Visual Recognition [78.70453964041718]
Current deep learning algorithms usually solve for the optimal classifier by emphimplicitly estimating the posterior probabilities.<n>This simple methodology has been proven effective for meticulously balanced academic benchmark datasets.<n>However, it is not applicable to the long-tailed data distributions in the real world.<n>This paper presents a novel approach (BAPE) that provides a more precise theoretical estimation of the data distributions.
arXiv Detail & Related papers (2025-06-29T15:12:50Z) - Distributional Properties of Subword Regularization [25.824110425757198]
BPE and MaxMatch, two popular subword tokenization schemes, have dropout regularization variants.
We show that these variants are heavily biased towards a small set of tokenizations per word.
We propose an algorithm to uniformly tokenizations that we use as a drop-in replacement for the aspects of existing tokenizers.
arXiv Detail & Related papers (2024-08-21T08:53:35Z) - Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple
Logits Retargeting Approach [102.0769560460338]
We develop a simple logits approach (LORT) without the requirement of prior knowledge of the number of samples per class.
Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018.
arXiv Detail & Related papers (2024-03-01T03:27:08Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Distributionally Robust Optimization with Bias and Variance Reduction [9.341215359733601]
We show that Prospect, a gradient-based algorithm, enjoys linear convergence for smooth regularized losses.
We also show that Prospect can converge 2-3$times$ faster than baselines such as gradient-based methods.
arXiv Detail & Related papers (2023-10-21T00:03:54Z) - A Heavy-Tailed Algebra for Probabilistic Programming [53.32246823168763]
We propose a systematic approach for analyzing the tails of random variables.
We show how this approach can be used during the static analysis (before drawing samples) pass of a probabilistic programming language compiler.
Our empirical results confirm that inference algorithms that leverage our heavy-tailed algebra attain superior performance across a number of density modeling and variational inference tasks.
arXiv Detail & Related papers (2023-06-15T16:37:36Z) - Statistical Efficiency of Score Matching: The View from Isoperimetry [96.65637602827942]
We show a tight connection between statistical efficiency of score matching and the isoperimetric properties of the distribution being estimated.
We formalize these results both in the sample regime and in the finite regime.
arXiv Detail & Related papers (2022-10-03T06:09:01Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Two-stage Training for Learning from Label Proportions [18.78148397471913]
Learning from label proportions (LLP) aims at learning an instance-level classifier with label proportions in grouped training data.
We introduce the mixup strategy and symmetric crossentropy to further reduce the label noise.
Our framework is model-agnostic, and demonstrates compelling performance improvement in extensive experiments.
arXiv Detail & Related papers (2021-05-22T03:55:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.