Related papers: DRO-REBEL: Distributionally Robust Relative-Reward Regression for Fast and Efficient LLM Alignment

DRO-REBEL: Distributionally Robust Relative-Reward Regression for Fast and Efficient LLM Alignment

URL: http://arxiv.org/abs/2509.19104v1
Date: Tue, 23 Sep 2025 14:49:48 GMT
Title: DRO-REBEL: Distributionally Robust Relative-Reward Regression for Fast and Efficient LLM Alignment
Authors: Sharan Sahu, Martin T. Wells,
Abstract summary: Reinforcement learning with human feedback (RLHF) has become crucial for aligning Large Language Models (LLMs) with human intent.<n>We introduce DRO-REBEL, a unified family of robust REBEL updates with type-$p$ Wasserstein, KL, and $chi2$ ambiguity sets.<n>Using Fenchel duality, each update reduces to a simple relative-reward regression, preserving scalability and avoiding PPO-style clipping or auxiliary value networks.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning with human feedback (RLHF) has become crucial for aligning Large Language Models (LLMs) with human intent. However, existing offline RLHF approaches suffer from overoptimization, where models overfit to reward misspecification and drift from preferred behaviors observed during training. We introduce DRO-REBEL, a unified family of robust REBEL updates with type-$p$ Wasserstein, KL, and $\chi^2$ ambiguity sets. Using Fenchel duality, each update reduces to a simple relative-reward regression, preserving scalability and avoiding PPO-style clipping or auxiliary value networks. Under standard linear-reward and log-linear policy classes with a data-coverage condition, we establish $O(n^{-1/4})$ estimation bounds with tighter constants than prior DRO-DPO approaches, and recover the minimax-optimal $O(n^{-1/2})$ rate via a localized Rademacher complexity analysis. The same analysis closes the gap for Wasserstein-DPO and KL-DPO, showing both also attain optimal parametric rates. We derive practical SGD algorithms for all three divergences: gradient regularization (Wasserstein), importance weighting (KL), and a fast 1-D dual solve ($\chi^2$). Experiments on Emotion Alignment, the large-scale ArmoRM multi-objective benchmark, and HH-Alignment demonstrate strong worst-case robustness across unseen preference mixtures, model sizes, and data scales, with $\chi^2$-REBEL showing consistently strong empirical performance. A controlled radius--coverage study validates a no-free-lunch trade-off: radii shrinking faster than empirical divergence concentration rates achieve minimax-optimal parametric rates but forfeit coverage, while coverage-guaranteeing radii incur $O(n^{-1/4})$ rates.

Related papers

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs [19.079556051442168]
Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks.<n>But for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly noisy.<n>We propose a stabilization method for REINFORCE/ GRPO-style algorithms that scales learning rate based on effective sample size to dampen unreliable updates.
arXiv Detail & Related papers (2026-02-19T18:40:51Z)
Differentially Private Non-convex Distributionally Robust Optimization [5.6429100834174655]
Real-world deployments routinely face group imbalances, adversarial perturbations, under which the traditional Empirical Minimization (ERM) framework can degrade severely.<n>This paper provides a comprehensive study of Differential Privacy (DP) optimization.<n>We show that our proposed methods outperform existing approaches for DP minimax optimization.
arXiv Detail & Related papers (2026-02-18T03:00:30Z)
Robust Variational Bayes by Min-Max Median Aggregation [13.102667562202386]
We propose a robust variational Bayes framework to handle contamination and outliers in dataset.<n>Our approach partitions the data into $m$ disjoint subsets and formulates a joint optimization problem based on robust aggregation principles.<n>Our findings indicate that the two-stage approach yields a smaller approximation error compared to directly aggregating the $m$-powered local posteriors.
arXiv Detail & Related papers (2025-12-14T13:02:00Z)
Closing the Approximation Gap of Partial AUC Optimization: A Tale of Two Formulations [121.39938773554523]
The Area Under the ROC Curve (AUC) is a pivotal evaluation metric in real-world scenarios with both class imbalance and decision constraints.<n>We present two simple instance-wise minimax reformulations to close the approximation gap of PAUC optimization.<n>The resulting algorithms enjoy a linear per-iteration computational complexity w.r.t. the sample size and a convergence rate of $O(-2/3)$ for typical one-way and two-way PAUCs.
arXiv Detail & Related papers (2025-12-01T02:52:33Z)
Better Rates for Private Linear Regression in the Proportional Regime via Aggressive Clipping [19.186034457189162]
A common approach is to set the clipping constant much larger than the expected norm of the per-sample gradients.<n>While simplifying the analysis, this is however in sharp contrast with what empirical evidence suggests to optimize performance.<n>Our work bridges this gap between theory and practice by crucially operating in a regime where clipping happens frequently.
arXiv Detail & Related papers (2025-05-22T07:34:27Z)
Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning [5.8191965840377735]
We propose two algorithms that achieve near-optimal sample complexity.<n>We prove that both algorithms attain a sample complexity of $widetildeOleft(|mathbfS||mathbfA| t_mathrmmix2varepsilon-2right)$ for estimating the optimal policy.<n>This represents the first finite-sample convergence guarantee for DR average-reward reinforcement learning.
arXiv Detail & Related papers (2025-05-15T06:42:25Z)
RePO: Understanding Preference Learning Through ReLU-Based Optimization [66.098833436503]
We propose ReLU-based Preference Optimization (RePO), a streamlined algorithm that eliminates $beta$ via two advances.<n>RePO is characterized as SimPO's limiting case ($beta to infty$), where the logistic weighting collapses to binary thresholding.<n> Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models.
arXiv Detail & Related papers (2025-03-10T15:11:07Z)
Robust Reinforcement Learning from Corrupted Human Feedback [86.17030012828003]
Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data. We propose a robust RLHF approach -- $R3M$, which models the potentially corrupted preference label as sparse outliers. Our experiments on robotic control and natural language generation with large language models (LLMs) show that $R3M$ improves robustness of the reward against several types of perturbations to the preference data.
arXiv Detail & Related papers (2024-06-21T18:06:30Z)
Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis [16.288866201806382]
We develop a model-free RLHF best policy identification algorithm, called $mathsfBSAD$, without explicit reward model inference.<n>The algorithm identifies the optimal policy directly from human preference information in a backward manner.
arXiv Detail & Related papers (2024-06-11T17:01:41Z)
Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED) PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z)
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences. Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z)
Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning [59.02006924867438]
Off-policy evaluation and learning (OPE/L) use offline observational data to make better decisions. Recent work proposed distributionally robust OPE/L (DROPE/L) to remedy this, but the proposal relies on inverse-propensity weighting. We propose the first DR algorithms for DROPE/L with KL-divergence uncertainty sets.
arXiv Detail & Related papers (2022-02-19T20:00:44Z)
Near-Optimal Offline Reinforcement Learning via Double Variance Reduction [36.027428493021716]
Off-Policy Double Variance Reduction is a new variance reduction based algorithm for offline RL. We show that OPDVR provably identifies an $epsilon$-optimal policy with $widetildeO(H2/d_mepsilon2)$ episodes of offline data. We also show that OPDVR also achieves rate-optimal sample complexity under alternative settings.
arXiv Detail & Related papers (2021-02-02T20:47:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.