Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation
- URL: http://arxiv.org/abs/2205.11140v2
- Date: Tue, 24 May 2022 02:42:19 GMT
- Title: Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation
- Authors: Xiaoyu Chen, Han Zhong, Zhuoran Yang, Zhaoran Wang, Liwei Wang
- Abstract summary: We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
- Score: 107.54516740713969
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study human-in-the-loop reinforcement learning (RL) with trajectory
preferences, where instead of receiving a numeric reward at each step, the
agent only receives preferences over trajectory pairs from a human overseer.
The goal of the agent is to learn the optimal policy which is most preferred by
the human overseer. Despite the empirical successes, the theoretical
understanding of preference-based RL (PbRL) is only limited to the tabular
case. In this paper, we propose the first optimistic model-based algorithm for
PbRL with general function approximation, which estimates the model using
value-targeted regression and calculates the exploratory policies by solving an
optimistic planning problem. Our algorithm achieves the regret of $\tilde{O}
(\operatorname{poly}(d H) \sqrt{K} )$, where $d$ is the complexity measure of
the transition and preference model depending on the Eluder dimension and
log-covering numbers, $H$ is the planning horizon, $K$ is the number of
episodes, and $\tilde O(\cdot)$ omits logarithmic terms. Our lower bound
indicates that our algorithm is near-optimal when specialized to the linear
setting. Furthermore, we extend the PbRL problem by formulating a novel problem
called RL with $n$-wise comparisons, and provide the first sample-efficient
algorithm for this new setting. To the best of our knowledge, this is the first
theoretical result for PbRL with (general) function approximation.
Related papers
- Offline Primal-Dual Reinforcement Learning for Linear MDPs [16.782625445546273]
Offline Reinforcement Learning (RL) aims to learn a near-optimal policy from a fixed dataset of transitions collected by another policy.
This paper proposes a primal-dual optimization method based on the linear programming formulation of RL.
arXiv Detail & Related papers (2023-05-22T11:45:23Z) - Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations.
Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - VO$Q$L: Towards Optimal Regret in Model-free RL with Nonlinear Function
Approximation [43.193807443491814]
We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards.
We design a new algorithm, Variance-weighted Optimistic $Q$-Learning (VO$Q$L), based on $Q$-learning and bound its regret dimension to completeness and bounded Eluder for the regression function class.
arXiv Detail & Related papers (2022-12-12T17:37:00Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - Reinforcement Learning with General Value Function Approximation:
Provably Efficient Approach via Bounded Eluder Dimension [124.7752517531109]
We establish a provably efficient reinforcement learning algorithm with general value function approximation.
We show that our algorithm achieves a regret bound of $widetildeO(mathrmpoly(dH)sqrtT)$ where $d$ is a complexity measure.
Our theory generalizes recent progress on RL with linear value function approximation and does not make explicit assumptions on the model of the environment.
arXiv Detail & Related papers (2020-05-21T17:36:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.