Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward
- URL: http://arxiv.org/abs/2206.06426v2
- Date: Wed, 19 Apr 2023 01:46:15 GMT
- Title: Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward
- Authors: Tengyu Xu, Yue Wang, Shaofeng Zou, Yingbin Liang
- Abstract summary: We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
- Score: 66.81579829897392
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The remarkable success of reinforcement learning (RL) heavily relies on
observing the reward of every visited state-action pair. In many real world
applications, however, an agent can observe only a score that represents the
quality of the whole trajectory, which is referred to as the {\em
trajectory-wise reward}. In such a situation, it is difficult for standard RL
methods to well utilize trajectory-wise reward, and large bias and variance
errors can be incurred in policy evaluation. In this work, we propose a novel
offline RL algorithm, called Pessimistic vAlue iteRaTion with rEward
Decomposition (PARTED), which decomposes the trajectory return into per-step
proxy rewards via least-squares-based reward redistribution, and then performs
pessimistic value iteration based on the learned proxy reward. To ensure the
value functions constructed by PARTED are always pessimistic with respect to
the optimal ones, we design a new penalty term to offset the uncertainty of the
proxy reward. For general episodic MDPs with large state space, we show that
PARTED with overparameterized neural network function approximation achieves an
$\tilde{\mathcal{O}}(D_{\text{eff}}H^2/\sqrt{N})$ suboptimality, where $H$ is
the length of episode, $N$ is the total number of samples, and $D_{\text{eff}}$
is the effective dimension of the neural tangent kernel matrix. To further
illustrate the result, we show that PARTED achieves an
$\tilde{\mathcal{O}}(dH^3/\sqrt{N})$ suboptimality with linear MDPs, where $d$
is the feature dimension, which matches with that with neural network function
approximation, when $D_{\text{eff}}=dH$. To the best of our knowledge, PARTED
is the first offline RL algorithm that is provably efficient in general MDP
with trajectory-wise reward.
Related papers
- Uncertainty-Aware Reward-Free Exploration with General Function Approximation [69.27868448449755]
In this paper, we propose a reward-free reinforcement learning algorithm called alg.
The key idea behind our algorithm is an uncertainty-aware intrinsic reward for exploring the environment.
Experiment results show that GFA-RFE outperforms or is comparable to the performance of state-of-the-art unsupervised RL algorithms.
arXiv Detail & Related papers (2024-06-24T01:37:18Z) - Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning [17.239062061431646]
This paper studies reward-agnostic exploration in reinforcement learning (RL)
Consider a finite-horizon inhomogeneous decision process with $S$ states, $A$ actions, and a horizon length $H$.
Our algorithm is able to yield $varepsilon$ accuracy for arbitrarily many reward functions.
arXiv Detail & Related papers (2023-04-14T17:46:49Z) - Provably Efficient Neural Offline Reinforcement Learning via Perturbed
Rewards [33.88533898709351]
VIPeR amalgamates the randomized value function idea with the pessimism principle.
It implicitly obtains pessimism by simply perturbing the offline data multiple times.
It is both provably and computationally efficient in general Markov decision processes (MDPs) with neural network function approximation.
arXiv Detail & Related papers (2023-02-24T17:52:12Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - On Reward-Free Reinforcement Learning with Linear Function Approximation [144.4210285338698]
Reward-free reinforcement learning (RL) is a framework which is suitable for both the batch RL setting and the setting where there are many reward functions of interest.
In this work, we give both positive and negative results for reward-free RL with linear function approximation.
arXiv Detail & Related papers (2020-06-19T17:59:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.