Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity
- URL: http://arxiv.org/abs/2512.05962v1
- Date: Fri, 05 Dec 2025 18:56:40 GMT
- Title: Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity
- Authors: Germán Kruszewski, Pierre Erbacher, Jos Rozen, Marc Dymetman,
- Abstract summary: Reinforcement Learning (RL) has become the de facto standard for tuning LLMs to solve tasks involving reasoning.<n>We argue that RL implicitly optimize the "mode-seeking" or "zero-forcing" Reverse KL to a target distribution causing the model to concentrate mass on certain high-probability regions of the target while others.<n>In this work, we instead begin from an explicit target distribution, obtained by filtering out incorrect answers while neglecting the relative probabilities of correct ones.
- Score: 13.211627219720796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning (RL) has become the de facto standard for tuning LLMs to solve tasks involving reasoning. However, growing evidence shows that models trained in such way often suffer from a significant loss in diversity. We argue that this arises because RL implicitly optimizes the "mode-seeking" or "zero-forcing" Reverse KL to a target distribution causing the model to concentrate mass on certain high-probability regions of the target while neglecting others. In this work, we instead begin from an explicit target distribution, obtained by filtering out incorrect answers while preserving the relative probabilities of correct ones. Starting from a pre-trained LLM, we approximate this target distribution using the $α$-divergence family, which unifies prior approaches and enables direct control of the precision-diversity trade-off by interpolating between mode-seeking and mass-covering divergences. On a Lean theorem-proving benchmark, our method achieves state-of-the-art performance along the coverage-precision Pareto frontier, outperforming all prior methods on the coverage axis.
Related papers
- SetPO: Set-Level Policy Optimization for Diversity-Preserving LLM Reasoning [50.93295951454092]
We introduce a set level diversity objective defined over sampled trajectories using kernelized similarity.<n>Our approach derives a leave-one-out marginal contribution for each sampled trajectory and integrates this objective as a plug-in advantage shaping term for policy optimization.<n>Experiments across a range of model scales demonstrate the effectiveness of our proposed algorithm, consistently outperforming strong baselines in both Pass@1 and Pass@K across various benchmarks.
arXiv Detail & Related papers (2026-02-01T07:13:20Z) - Decoupling the Effect of Chain-of-Thought Reasoning: A Human Label Variation Perspective [60.45433515408158]
We show that long Chain-of-Thought (CoT) serves as a decisive decision-maker for the top option but fails to function as a granular distribution calibrator for ambiguous tasks.<n>We observe a distinct "decoupled mechanism": while CoT improves distributional alignment, final accuracy is dictated by CoT content.
arXiv Detail & Related papers (2026-01-06T16:26:40Z) - FlowRL: Matching Reward Distributions for LLM Reasoning [69.88820066093798]
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL)<n>We transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution.
arXiv Detail & Related papers (2025-09-18T17:56:36Z) - Aligning Latent Spaces with Flow Priors [72.24305287508474]
This paper presents a novel framework for aligning learnable latent spaces to arbitrary target distributions by leveraging flow-based generative models as priors.<n> Notably, the proposed method eliminates computationally expensive likelihood evaluations and avoids ODE solving during optimization.
arXiv Detail & Related papers (2025-06-05T16:59:53Z) - Importance Weighted Score Matching for Diffusion Samplers with Enhanced Mode Coverage [16.94974733994214]
prevailing methods often circumvent the lack of target data by optimizing reverse KL-based objectives.<n>We propose a principled approach for training diffusion-based samplers by directly targeting an objective analogous to the forward KL divergence.<n>Our approach consistently outperforms existing neural samplers across all distributional distance metrics.
arXiv Detail & Related papers (2025-05-26T02:48:26Z) - Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models [31.589548159881932]
We introduce the Diffusion Chain of Lateral Thought (DCoLT), a reasoning framework for diffusion language models.<n>DCoLT allows bidirectional, non-linear reasoning with no strict rule on grammatical correctness amid its intermediate steps of thought.<n>We show that DCoLT-reinforced Diffusion Language Models (DLMs) outperform other DLMs trained by SFT or RL.
arXiv Detail & Related papers (2025-05-15T16:06:32Z) - Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [79.76293901420146]
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial.
Our research investigates the fragility of uncertainty estimation and explores potential attacks.
We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output.
arXiv Detail & Related papers (2024-07-15T23:41:11Z) - Aligning Language Models with Preferences through f-divergence
Minimization [4.952674870169772]
f-DPG allows the use of any f-divergence to approximate any target distribution that can be evaluated.
We show that Jensen-Shannon divergence strikes a good balance between these objectives, and frequently outperforms forward KL divergence by a wide margin.
arXiv Detail & Related papers (2023-02-16T10:59:39Z) - KL Guided Domain Adaptation [88.19298405363452]
Domain adaptation is an important problem and often needed for real-world applications.
A common approach in the domain adaptation literature is to learn a representation of the input that has the same distributions over the source and the target domain.
We show that with a probabilistic representation network, the KL term can be estimated efficiently via minibatch samples.
arXiv Detail & Related papers (2021-06-14T22:24:23Z) - A Distributional Approach to Controlled Text Generation [3.279201607581627]
We propose a Distributional Approach to address Controlled Text Generation from pre-trained Language Models (LMs)
This view permits to define, in a single formal framework, "pointwise" and "distributional" constraints over the target LM.
We then perform experiments over distributional constraints, a unique feature of our approach, demonstrating its potential as a remedy to the problem of Bias in Language Models.
arXiv Detail & Related papers (2020-12-21T19:02:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.