Offline Reinforcement Learning for Human-Guided Human-Machine
Interaction with Private Information
- URL: http://arxiv.org/abs/2212.12167v1
- Date: Fri, 23 Dec 2022 06:26:44 GMT
- Title: Offline Reinforcement Learning for Human-Guided Human-Machine
Interaction with Private Information
- Authors: Zuyue Fu, Zhengling Qi, Zhuoran Yang, Zhaoran Wang, Lan Wang
- Abstract summary: We study human-guided human-machine interaction involving private information.
We focus on offline reinforcement learning (RL) in this game.
We develop a novel identification result and use it to propose a new off-policy evaluation method.
- Score: 110.42866062614912
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motivated by the human-machine interaction such as training chatbots for
improving customer satisfaction, we study human-guided human-machine
interaction involving private information. We model this interaction as a
two-player turn-based game, where one player (Alice, a human) guides the other
player (Bob, a machine) towards a common goal. Specifically, we focus on
offline reinforcement learning (RL) in this game, where the goal is to find a
policy pair for Alice and Bob that maximizes their expected total rewards based
on an offline dataset collected a priori. The offline setting presents two
challenges: (i) We cannot collect Bob's private information, leading to a
confounding bias when using standard RL methods, and (ii) a distributional
mismatch between the behavior policy used to collect data and the desired
policy we aim to learn. To tackle the confounding bias, we treat Bob's previous
action as an instrumental variable for Alice's current decision making so as to
adjust for the unmeasured confounding. We develop a novel identification result
and use it to propose a new off-policy evaluation (OPE) method for evaluating
policy pairs in this two-player turn-based game. To tackle the distributional
mismatch, we leverage the idea of pessimism and use our OPE method to develop
an off-policy learning algorithm for finding a desirable policy pair for both
Alice and Bob. Finally, we prove that under mild assumptions such as partial
coverage of the offline data, the policy pair obtained through our method
converges to the optimal one at a satisfactory rate.
Related papers
- Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning [55.65738319966385]
We propose a novel online algorithm, iterative Nash policy optimization (INPO)
Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses.
With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard.
arXiv Detail & Related papers (2024-06-30T08:00:34Z) - Reinforcement Learning with Human Feedback: Learning Dynamic Choices via
Pessimism [91.52263068880484]
We study offline Reinforcement Learning with Human Feedback (RLHF)
We aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices.
RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift.
arXiv Detail & Related papers (2023-05-29T01:18:39Z) - Beyond Reward: Offline Preference-guided Policy Optimization [18.49648170835782]
offline preference-based reinforcement learning (PbRL) is a variant of conventional reinforcement learning that dispenses with the need for online interaction.
This study focuses on the topic of offline preference-guided policy optimization (OPPO)
OPPO models offline trajectories and preferences in a one-step process, eliminating the need for separately learning a reward function.
arXiv Detail & Related papers (2023-05-25T16:24:11Z) - Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback.
We consider the general reward setting where the reward can be defined over the whole trajectory.
We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z) - Offline Preference-Based Apprenticeship Learning [11.21888613165599]
We study how an offline dataset can be used to address two challenges that autonomous systems face when they endeavor to learn from, adapt to, and collaborate with humans.
First, we use the offline dataset to efficiently infer the human's reward function via pool-based active preference learning.
Second, given this learned reward function, we perform offline reinforcement learning to optimize a policy based on the inferred human intent.
arXiv Detail & Related papers (2021-07-20T04:15:52Z) - Offline Reinforcement Learning as Anti-Exploration [49.72457136766916]
We take inspiration from the literature on bonus-based exploration to design a new offline RL agent.
The core idea is to subtract a prediction-based exploration bonus from the reward, instead of adding it for exploration.
We show that our agent is competitive with the state of the art on a set of continuous control locomotion and manipulation tasks.
arXiv Detail & Related papers (2021-06-11T14:41:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.