Off-Policy Evaluation for Human Feedback
- URL: http://arxiv.org/abs/2310.07123v2
- Date: Sat, 14 Oct 2023 16:38:00 GMT
- Title: Off-Policy Evaluation for Human Feedback
- Authors: Qitong Gao, Ge Gao, Juncheng Dong, Vahid Tarokh, Min Chi, Miroslav
Pajic
- Abstract summary: Off-policy evaluation (OPE) is important for closing the gap between offline training and evaluation of reinforcement learning (RL)
Existing OPE methods fall short in estimating human feedback (HF) signals.
We introduce an OPE for HF (OPEHF) framework that revives existing OPE methods in order to accurately evaluate the HF signals.
- Score: 46.82894469763776
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Off-policy evaluation (OPE) is important for closing the gap between offline
training and evaluation of reinforcement learning (RL), by estimating
performance and/or rank of target (evaluation) policies using offline
trajectories only. It can improve the safety and efficiency of data collection
and policy testing procedures in situations where online deployments are
expensive, such as healthcare. However, existing OPE methods fall short in
estimating human feedback (HF) signals, as HF may be conditioned over multiple
underlying factors and is only sparsely available; as opposed to the
agent-defined environmental rewards (used in policy optimization), which are
usually determined over parametric functions or distributions. Consequently,
the nature of HF signals makes extrapolating accurate OPE estimations to be
challenging. To resolve this, we introduce an OPE for HF (OPEHF) framework that
revives existing OPE methods in order to accurately evaluate the HF signals.
Specifically, we develop an immediate human reward (IHR) reconstruction
approach, regularized by environmental knowledge distilled in a latent space
that captures the underlying dynamics of state transitions as well as issuing
HF signals. Our approach has been tested over two real-world experiments,
adaptive in-vivo neurostimulation and intelligent tutoring, as well as in a
simulation environment (visual Q&A). Results show that our approach
significantly improves the performance toward estimating HF signals accurately,
compared to directly applying (variants of) existing OPE methods.
Related papers
- Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization [23.817251267022847]
We propose the Behavior-Supported Policy Optimization (BSPO) method to mitigate the reward over-optimization issue.
BSPO reduces the generation of OOD responses during the reinforcement learning process.
Empirical results show that BSPO outperforms baselines in preventing reward over-optimization.
arXiv Detail & Related papers (2025-03-23T16:20:59Z) - Provably Efficient RLHF Pipeline: A Unified View from Contextual Bandits [59.30310692855397]
We propose a unified framework for the RLHF pipeline from the view of contextual bandits.
We decompose the RLHF process into two distinct stages: (post-)training and deployment.
We then develop novel algorithms for each stage, demonstrating significant improvements in both statistical and computational efficiency.
arXiv Detail & Related papers (2025-02-11T02:36:01Z) - Adaptive Dense Reward: Understanding the Gap Between Action and Reward Space in Alignment [33.5805074836187]
Reinforcement Learning from Human Feedback (RLHF) has proven highly effective in aligning Large Language Models (LLMs) with human preferences.
This limitation stems from RLHF's lack of awareness regarding which specific tokens should be reinforced or suppressed.
We propose the Adaptive Message-wise RLHF'' method, which robustly applies to various tasks.
arXiv Detail & Related papers (2024-10-23T16:16:15Z) - Joint Demonstration and Preference Learning Improves Policy Alignment with Human Feedback [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.
The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.
We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z) - Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards [26.40009657912622]
Reinforcement learning from human feedback (RLHF) is the mainstream paradigm used to align large language models (LLMs) with human preferences.
Yet existing RLHF heavily relies on accurate and informative reward models, which are vulnerable and sensitive to noise from various sources.
In this work, we improve the effectiveness of the reward model by introducing a penalty term on the reward, named as textitcontrastive rewards
arXiv Detail & Related papers (2024-03-12T14:51:57Z) - Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization [56.54271464134885]
We consider an RLHF algorithm based on policy optimization (PO-RLHF)
We provide performance bounds for PO-RLHF with low query complexity.
Key novelty is a trajectory-level elliptical potential analysis.
arXiv Detail & Related papers (2024-02-15T22:11:18Z) - Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [56.74058752955209]
This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF)
We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment.
We propose efficient algorithms with finite-sample theoretical guarantees.
arXiv Detail & Related papers (2023-12-18T18:58:42Z) - Sample Complexity of Preference-Based Nonparametric Off-Policy
Evaluation with Deep Networks [58.469818546042696]
We study the sample efficiency of OPE with human preference and establish a statistical guarantee for it.
By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2023-10-16T16:27:06Z) - Hindsight-DICE: Stable Credit Assignment for Deep Reinforcement Learning [11.084321518414226]
We adapt existing importance-sampling ratio estimation techniques for off-policy evaluation to drastically improve the stability and efficiency of so-called hindsight policy methods.
Our hindsight distribution correction facilitates stable, efficient learning across a broad range of environments where credit assignment plagues baseline methods.
arXiv Detail & Related papers (2023-07-21T20:54:52Z) - Interpretable Off-Policy Evaluation in Reinforcement Learning by
Highlighting Influential Transitions [48.91284724066349]
Off-policy evaluation in reinforcement learning offers the chance of using observational data to improve future outcomes in domains such as healthcare and education.
Traditional measures such as confidence intervals may be insufficient due to noise, limited data and confounding.
We develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates.
arXiv Detail & Related papers (2020-02-10T00:26:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.