PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback
- URL: http://arxiv.org/abs/2308.02585v3
- Date: Tue, 30 Apr 2024 21:05:21 GMT
- Title: PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback
- Authors: Souradip Chakraborty, Amrit Singh Bedi, Alec Koppel, Dinesh Manocha, Huazheng Wang, Mengdi Wang, Furong Huang,
- Abstract summary: We present a novel unified bilevel optimization-based framework, textsfPARL, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning.
Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable.
Our empirical results substantiate that the proposed textsfPARL can address the alignment concerns in RL by showing significant improvements.
- Score: 106.63518036538163
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel unified bilevel optimization-based framework, \textsf{PARL}, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning using utility or preference-based feedback. We identify a major gap within current algorithmic designs for solving policy alignment due to a lack of precise characterization of the dependence of the alignment objective on the data generated by policy trajectories. This shortfall contributes to the sub-optimal performance observed in contemporary algorithms. Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable (optimal policy for the designed reward). Interestingly, from an optimization perspective, our formulation leads to a new class of stochastic bilevel problems where the stochasticity at the upper objective depends upon the lower-level variable. {True to our best knowledge, this work presents the first formulation of the RLHF as a bilevel optimization problem which generalizes the existing RLHF formulations and addresses the existing distribution shift issues in RLHF formulations.} To demonstrate the efficacy of our formulation in resolving alignment issues in RL, we devised an algorithm named \textsf{A-PARL} to solve PARL problem, establishing sample complexity bounds of order $\mathcal{O}(1/T)$. Our empirical results substantiate that the proposed \textsf{PARL} can address the alignment concerns in RL by showing significant improvements (up to 63\% in terms of required samples) for policy alignment in large-scale environments of the Deepmind control suite and Meta world tasks.
Related papers
- SAIL: Self-Improving Efficient Online Alignment of Large Language Models [56.59644677997827]
Reinforcement Learning from Human Feedback is a key method for aligning large language models with human preferences.
Recent literature has focused on designing online RLHF methods but still lacks a unified conceptual formulation.
Our approach significantly improves alignment performance on open-sourced datasets with minimal computational overhead.
arXiv Detail & Related papers (2024-06-21T18:05:35Z) - e-COP : Episodic Constrained Optimization of Policies [12.854752753529151]
We present the first policy optimization algorithm for constrained Reinforcement Learning (RL) in episodic (finite horizon) settings.
We show that our algorithm has similar or better performance than SoTA (non-episodic) algorithms adapted for the episodic setting.
arXiv Detail & Related papers (2024-06-13T20:12:09Z) - Joint Demonstration and Preference Learning Improves Policy Alignment with Human Feedback [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.
The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.
We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Robust Data-driven Prescriptiveness Optimization [4.792851066169871]
This paper introduces a distributionally robust contextual optimization model where the coefficient of prescriptiveness substitutes for the classical empirical risk objective minimization.
We evaluate the robustness of the resulting policies against alternative methods when the out-of-sample dataset is subject to varying amounts of distribution shift.
arXiv Detail & Related papers (2023-06-09T14:56:06Z) - Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback.
We consider the general reward setting where the reward can be defined over the whole trajectory.
We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Constrained Combinatorial Optimization with Reinforcement Learning [0.30938904602244344]
This paper presents a framework to tackle constrained optimization problems using deep Reinforcement Learning (RL)
We extend the Neural Combinatorial Optimization (NCO) theory in order to deal with constraints in its formulation.
In that context, the solution is iteratively constructed based on interactions with the environment.
arXiv Detail & Related papers (2020-06-22T03:13:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.