Failure Modes of Maximum Entropy RLHF
- URL: http://arxiv.org/abs/2509.20265v1
- Date: Wed, 24 Sep 2025 15:52:36 GMT
- Title: Failure Modes of Maximum Entropy RLHF
- Authors: Ömer Veysel Çağatan, Barış Akgün,
- Abstract summary: We show that SimPO can be derived as Maximum Entropy Reinforcement Learning with length-normalized temperature.<n>We investigate whether Maximum Entropy RL can achieve similar results in online RLHF settings.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning with length-normalized temperature, providing a theoretical foundation for this reference-free method. Motivated by SimPO's strong performance in offline preference optimization, we investigate whether Maximum Entropy RL can achieve similar results in online RLHF settings. Our experiments find that Maximum Entropy RL consistently exhibits overoptimization and unstable KL dynamics, even at very low learning rates. Unlike KL-constrained methods that maintain stable training, entropy regularization fails to prevent reward hacking and appears to correlate with overoptimization. Lastly, we discuss possible explanations for why SimPO succeeds in offline settings while Maximum Entropy RL struggles in online scenarios. Our findings suggest that reference-free approaches may face distinct challenges when applied to online or offline preference learning.
Related papers
- Stabilizing Policy Optimization via Logits Convexity [59.242732612484474]
We show that the convexity of the supervised fine-tuning loss with respect to model logits plays a key role in enabling stable training.<n>Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework.
arXiv Detail & Related papers (2026-03-01T07:40:12Z) - FLAC: Maximum Entropy RL via Kinetic Energy Regularized Bridge Matching [28.98935867615678]
We propose a framework that regulates policyity by penalizing the kinetic energy of the velocity field.<n>We derive an energy-regularized policy scheme and a practical off-policy algorithm that automatically tunes the kinetic energy.
arXiv Detail & Related papers (2026-02-13T11:32:10Z) - SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for Reinforcement Learning from Human Feedback (RLHF) [0.0]
We develop a new pure on policy actor-critic RL method for the LM-RLHF setting.<n>We present SAFE (Stable Alignment Finetuning with Entropy-aware control),a novel RLHF algorithm that combines a Double Soft-Min Critic for pessimistic value estimation with a new multi-layer stabilization framework.
arXiv Detail & Related papers (2026-02-04T15:26:44Z) - Boosting Maximum Entropy Reinforcement Learning via One-Step Flow Matching [8.665369041430969]
Flow Matching (FM) enables one-step generation, but integrating it into Entropy Reinforcement Learning (MaxEnt RL) is challenging.<n>We propose textbfFlow-based textbfLog-likelihood-textbfAware textbfMaximum textbfEntropy RL (textbfFLAME), a principled framework that addresses these challenges.
arXiv Detail & Related papers (2026-02-02T03:54:11Z) - Q-learning with Adjoint Matching [58.78551025170267]
We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm.<n>QAM sidesteps two challenges by leveraging adjoint matching, a recently proposed technique in generative modeling.<n>It consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.
arXiv Detail & Related papers (2026-01-20T18:45:34Z) - DIME:Diffusion-Based Maximum Entropy Reinforcement Learning [38.17326719163195]
Diffusion-Based Maximum Entropy RL (DIME)<n>emphDIME leverages recent advances in approximate inference with diffusion models to derive a lower bound on the maximum entropy objective.<n>Our method enables the use of expressive diffusion-based policies while retaining the principled exploration benefits of MaxEnt-RL.
arXiv Detail & Related papers (2025-02-04T13:37:14Z) - Maximum Entropy On-Policy Actor-Critic via Entropy Advantage Estimation [0.276240219662896]
A notable form of entropy regularisation is augmenting the objective with an entropy term, thereby simultaneously optimising the expected return and the entropy.
This framework, known as maximum entropy reinforcement learning (MaxEnt RL), has shown theoretical and empirical successes.
This paper proposes a simple method of separating the entropy objective from the MaxEnt RL objective, which facilitates the implementation of MaxEnt RL in on-policy settings.
arXiv Detail & Related papers (2024-07-25T15:48:24Z) - Preference Elicitation for Offline Reinforcement Learning [59.136381500967744]
We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm.<n>Our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy.
arXiv Detail & Related papers (2024-06-26T15:59:13Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [56.74058752955209]
This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF)
We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment.
We propose efficient algorithms with finite-sample theoretical guarantees.
arXiv Detail & Related papers (2023-12-18T18:58:42Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.