Provably Efficient Neural Offline Reinforcement Learning via Perturbed
Rewards
- URL: http://arxiv.org/abs/2302.12780v1
- Date: Fri, 24 Feb 2023 17:52:12 GMT
- Title: Provably Efficient Neural Offline Reinforcement Learning via Perturbed
Rewards
- Authors: Thanh Nguyen-Tang, Raman Arora
- Abstract summary: VIPeR amalgamates the randomized value function idea with the pessimism principle.
It implicitly obtains pessimism by simply perturbing the offline data multiple times.
It is both provably and computationally efficient in general Markov decision processes (MDPs) with neural network function approximation.
- Score: 33.88533898709351
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel offline reinforcement learning (RL) algorithm, namely
Value Iteration with Perturbed Rewards (VIPeR) which amalgamates the randomized
value function idea with the pessimism principle. Most current offline RL
algorithms explicitly construct statistical confidence regions to obtain
pessimism via lower confidence bounds (LCB), which cannot easily scale to
complex problems where a neural network is used to estimate the value
functions. Instead, VIPeR implicitly obtains pessimism by simply perturbing the
offline data multiple times with carefully-designed i.i.d Gaussian noises to
learn an ensemble of estimated state-action values and acting greedily to the
minimum of the ensemble. The estimated state-action values are obtained by
fitting a parametric model (e.g. neural networks) to the perturbed datasets
using gradient descent. As a result, VIPeR only needs $\mathcal{O}(1)$ time
complexity for action selection while LCB-based algorithms require at least
$\Omega(K^2)$, where $K$ is the total number of trajectories in the offline
data. We also propose a novel data splitting technique that helps remove the
potentially large log covering number in the learning bound. We prove that
VIPeR yields a provable uncertainty quantifier with overparameterized neural
networks and achieves an $\tilde{\mathcal{O}}\left( \frac{ \kappa H^{5/2}
\tilde{d} }{\sqrt{K}} \right)$ sub-optimality where $\tilde{d}$ is the
effective dimension, $H$ is the horizon length and $\kappa$ measures the
distributional shift. We corroborate the statistical and computational
efficiency of VIPeR with an empirical evaluation in a wide set of synthetic and
real-world datasets. To the best of our knowledge, VIPeR is the first offline
RL algorithm that is both provably and computationally efficient in general
Markov decision processes (MDPs) with neural network function approximation.
Related papers
- Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods [43.32546195968771]
We study the data-dependent convergence and generalization behavior of gradient methods for neural networks with smooth activation.
Our results improve upon the shortcomings of the well-established Rademacher complexity-based bounds.
We show that a large step-size significantly improves upon the NTK regime's results in classifying the XOR distribution.
arXiv Detail & Related papers (2024-10-13T21:49:29Z) - Provably Efficient CVaR RL in Low-rank MDPs [58.58570425202862]
We study risk-sensitive Reinforcement Learning (RL)
We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to balance interplay between exploration, exploitation, and representation learning in CVaR RL.
We prove that our algorithm achieves a sample complexity of $epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.
arXiv Detail & Related papers (2023-11-20T17:44:40Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - Training Overparametrized Neural Networks in Sublinear Time [14.918404733024332]
Deep learning comes at a tremendous computational and energy cost.
We present a new and a subset of binary neural networks, as a small subset of search trees, where each corresponds to a subset of search trees (Ds)
We believe this view would have further applications in analysis analysis of deep networks (Ds)
arXiv Detail & Related papers (2022-08-09T02:29:42Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.