Related papers: Distribution-Centric Policy Optimization Dominates Exploration-Exploitation Trade-off

Distribution-Centric Policy Optimization Dominates Exploration-Exploitation Trade-off

URL: http://arxiv.org/abs/2601.12730v1
Date: Mon, 19 Jan 2026 05:20:46 GMT
Title: Distribution-Centric Policy Optimization Dominates Exploration-Exploitation Trade-off
Authors: Zhaochun Li, Chen Wang, Jionghao Bai, Shisheng Cui, Ge Lan, Zhou Zhao, Yue Wang,
Abstract summary: We introduce a textbfdistribution-centric perspective for reinforcement learning.<n>We propose Distribution-Centric Policy Optimization (DCPO), which reformulates entropy regulation as distribution-level regularization.<n>Overall, DCPO replaces sample-levels with distribution-level principles, offering a theoretically grounded and flexible framework for exploration and a stronger EE trade-off.
Score: 34.80019950191864
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The exploration-exploitation (EE) trade-off is a central challenge in reinforcement learning (RL) for large language models (LLMs). With Group Relative Policy Optimization (GRPO), training tends to be exploitation driven: entropy decreases monotonically, samples convergence, and exploration fades. Most existing fixes are \textbf{sample-centric}: they seek or bonus rare samples, assuming exploration comes from novel trajectories and tokens. These heuristics depend on the "luck" of informative samples, lack principled control of the policy, and often yield limited or inconsistent gains. In this work, we are the first to introduce a \textbf{distribution-centric} perspective for RL, in which exploration is always guided by a "better" target distribution, and reveal that a policy's ability to resist entropy collapse is governed by the distribution itself rather than individual samples. Building on this insight, we propose Distribution-Centric Policy Optimization (DCPO), which reformulates entropy regulation as distribution-level regularization. DCPO achieves controllable entropy fully on-policy without sampling from external distributions, enabling efficient exploration while maintaining training stability. Across multiple models and seven benchmarks, DCPO improves over GRPO by about 20\% on average. Overall, DCPO replaces sample-level heuristics with distribution-level principles, offering a theoretically grounded and flexible framework for controllable exploration and a stronger EE trade-off. The code is available in https://github.com/597358816/DCPO.

Related papers

Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning [88.42566960813438]
CalibRL is a hybrid-policy RLVR framework that supports controllable exploration with expert guidance.<n>CalibRL increases policy entropy in a guided manner and clarifies the target distribution.<n>Experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements.
arXiv Detail & Related papers (2026-02-22T07:23:36Z)
Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z)
SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models [67.41779761651924]
SOUP is a framework that unifies off- and on-policy learning within individual samples at the token level.<n>It consistently outperforms standard on-policy training and existing off-policy extensions.
arXiv Detail & Related papers (2026-01-29T09:56:15Z)
Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning [49.92803982100042]
We propose using the entropy ratio between the current and previous policies as a new global metric.<n>We introduce an textbfEntropy Ratio (ERC) mechanism that imposes bidirectional constraints on the entropy ratio.<n>This stabilizes policy updates at the global distribution level and compensates for the inability of PPO-clip to regulate probability shifts of un-sampled actions.
arXiv Detail & Related papers (2025-12-05T10:26:32Z)
EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning [15.529826552402769]
Training LLM agents in multi-turn environments with sparse rewards presents a fundamental challenge for reinforcement learning.<n>We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure.<n>We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms.
arXiv Detail & Related papers (2025-09-26T16:51:44Z)
FlowRL: Matching Reward Distributions for LLM Reasoning [69.88820066093798]
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL)<n>We transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution.
arXiv Detail & Related papers (2025-09-18T17:56:36Z)
HAEPO: History-Aggregated Exploratory Policy Optimization [4.782714372521615]
We introduce History-Aggregated Exploratory Policy Optimization (HAEPO), a history-aware exploratory loss to combat shortcomings.<n>HAEPO compresses each trajectory into the sum of its logarithmic probabilities, and applies a Plackett-Luce softmax across trajectories.<n> Empirically, HAEPO converges fast, explores thoroughly, aligns closely with true rewards, and demonstrates robust learning behavior better or at par with PPO, GRPO, and DPO.
arXiv Detail & Related papers (2025-08-26T09:59:44Z)
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.<n>We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.<n>Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z)
Entropy Controllable Direct Preference Optimization [3.536605202672355]
We propose a simple modification to DPO, H-DPO, which allows for control over the entropy of the resulting policy.<n>In our experiments, we show that H-DPO outperformed DPO across various tasks, demonstrating superior results in pass@$k$ evaluations for mathematical tasks.
arXiv Detail & Related papers (2024-11-12T07:09:44Z)
Bag of Policies for Distributional Deep Exploration [7.522221438479138]
Bag of Policies (BoP) is built on top of any return distribution estimator by maintaining a population of its copies. During training, each episode is controlled by only one of the heads and the collected state-action pairs are used to update all heads off-policy. BoP results in greater robustness and speed during learning as demonstrated by our experimental results on ALE Atari games.
arXiv Detail & Related papers (2023-08-03T13:43:03Z)
Robust Policy Optimization in Deep Reinforcement Learning [16.999444076456268]
In continuous action domains, parameterized distribution of action distribution allows easy control of exploration. In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution. We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym.
arXiv Detail & Related papers (2022-12-14T22:43:56Z)
Distributional Reinforcement Learning via Moment Matching [54.16108052278444]
We formulate a method that learns a finite set of statistics from each return distribution via neural networks. Our method can be interpreted as implicitly matching all orders of moments between a return distribution and its Bellman target. Experiments on the suite of Atari games show that our method outperforms the standard distributional RL baselines.
arXiv Detail & Related papers (2020-07-24T05:18:17Z)
Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs) Semi-implicit actor (SIA) powered by a flexible policy distribution. We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.