KL-Regularized Reinforcement Learning is Designed to Mode Collapse
- URL: http://arxiv.org/abs/2510.20817v1
- Date: Thu, 23 Oct 2025 17:59:40 GMT
- Title: KL-Regularized Reinforcement Learning is Designed to Mode Collapse
- Authors: Anthony GX-Chen, Jatin Prakash, Jeff Guo, Rob Fergus, Rajesh Ranganath,
- Abstract summary: We show that the choice of reverse/forward KL determines the family of optimal target distributions.<n>We leverage these insights to construct a simple, scalable, and theoretically justified algorithm.
- Score: 29.23421728376746
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: It is commonly believed that optimizing the reverse KL divergence results in "mode seeking", while optimizing forward KL results in "mass covering", with the latter being preferred if the goal is to sample from multiple diverse modes. We show -- mathematically and empirically -- that this intuition does not necessarily transfer well to doing reinforcement learning with reverse/forward KL regularization (e.g. as commonly used with language models). Instead, the choice of reverse/forward KL determines the family of optimal target distributions, parameterized by the regularization coefficient. Mode coverage depends primarily on other factors, such as regularization strength, and relative scales between rewards and reference probabilities. Further, we show commonly used settings such as low regularization strength and equal verifiable rewards tend to specify unimodal target distributions, meaning the optimization objective is, by construction, non-diverse. We leverage these insights to construct a simple, scalable, and theoretically justified algorithm. It makes minimal changes to reward magnitudes, yet optimizes for a target distribution which puts high probability over all high-quality sampling modes. In experiments, this simple modification works to post-train both Large Language Models and Chemical Language Models to have higher solution quality and diversity, without any external signals of diversity, and works with both forward and reverse KL when using either naively fails.
Related papers
- SetPO: Set-Level Policy Optimization for Diversity-Preserving LLM Reasoning [50.93295951454092]
We introduce a set level diversity objective defined over sampled trajectories using kernelized similarity.<n>Our approach derives a leave-one-out marginal contribution for each sampled trajectory and integrates this objective as a plug-in advantage shaping term for policy optimization.<n>Experiments across a range of model scales demonstrate the effectiveness of our proposed algorithm, consistently outperforming strong baselines in both Pass@1 and Pass@K across various benchmarks.
arXiv Detail & Related papers (2026-02-01T07:13:20Z) - Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity [13.211627219720796]
Reinforcement Learning (RL) has become the de facto standard for tuning LLMs to solve tasks involving reasoning.<n>We argue that RL implicitly optimize the "mode-seeking" or "zero-forcing" Reverse KL to a target distribution causing the model to concentrate mass on certain high-probability regions of the target while others.<n>In this work, we instead begin from an explicit target distribution, obtained by filtering out incorrect answers while neglecting the relative probabilities of correct ones.
arXiv Detail & Related papers (2025-12-05T18:56:40Z) - FlowRL: Matching Reward Distributions for LLM Reasoning [69.88820066093798]
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL)<n>We transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution.
arXiv Detail & Related papers (2025-09-18T17:56:36Z) - Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications.<n> Ensuring their alignment with the diverse preferences of individual users has become a critical challenge.<n>We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z) - Decoding-Time Language Model Alignment with Multiple Objectives [116.42095026960598]
Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives.
Here, we propose $textbfmulti-objective decoding (MOD)$, a decoding-time algorithm that outputs the next token from a linear combination of predictions.
We show why existing approaches can be sub-optimal even in natural settings and obtain optimality guarantees for our method.
arXiv Detail & Related papers (2024-06-27T02:46:30Z) - Maximum Likelihood Estimation is All You Need for Well-Specified
Covariate Shift [34.414261291690856]
Key challenge of modern machine learning systems is to achieve Out-of-Distribution (OOD) generalization.
We show that classical Maximum Likelihood Estimation (MLE) purely using source data achieves the minimax optimality.
We illustrate the wide applicability of our framework by instantiating it to three concrete examples.
arXiv Detail & Related papers (2023-11-27T16:06:48Z) - Aligning Language Models with Preferences through f-divergence
Minimization [4.952674870169772]
f-DPG allows the use of any f-divergence to approximate any target distribution that can be evaluated.
We show that Jensen-Shannon divergence strikes a good balance between these objectives, and frequently outperforms forward KL divergence by a wide margin.
arXiv Detail & Related papers (2023-02-16T10:59:39Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Variational Refinement for Importance Sampling Using the Forward
Kullback-Leibler Divergence [77.06203118175335]
Variational Inference (VI) is a popular alternative to exact sampling in Bayesian inference.
Importance sampling (IS) is often used to fine-tune and de-bias the estimates of approximate Bayesian inference procedures.
We propose a novel combination of optimization and sampling techniques for approximate Bayesian inference.
arXiv Detail & Related papers (2021-06-30T11:00:24Z) - KL Guided Domain Adaptation [88.19298405363452]
Domain adaptation is an important problem and often needed for real-world applications.
A common approach in the domain adaptation literature is to learn a representation of the input that has the same distributions over the source and the target domain.
We show that with a probabilistic representation network, the KL term can be estimated efficiently via minibatch samples.
arXiv Detail & Related papers (2021-06-14T22:24:23Z) - Markovian Score Climbing: Variational Inference with KL(p||q) [16.661889249333676]
We develop a simple algorithm for reliably minimizing the "exclusive Kullback-Leibler (KL)" KL(p q)
This method converges to a local optimum of the inclusive KL.
It does not suffer from the systematic errors inherent in existing methods, such as Reweighted Wake-Sleep and Neural Adaptive Monte Carlo.
arXiv Detail & Related papers (2020-03-23T16:38:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.