Bridging RL Theory and Practice with the Effective Horizon
- URL: http://arxiv.org/abs/2304.09853v3
- Date: Thu, 11 Jan 2024 22:51:24 GMT
- Title: Bridging RL Theory and Practice with the Effective Horizon
- Authors: Cassidy Laidlaw and Stuart Russell and Anca Dragan
- Abstract summary: We show that prior bounds do not correlate well with when deep RL succeeds vs. fails.
We generalize this into a new complexity measure of an MDP that we call the effective horizon.
We also find that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy.
- Score: 18.706109961534676
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep reinforcement learning (RL) works impressively in some environments and
fails catastrophically in others. Ideally, RL theory should be able to provide
an understanding of why this is, i.e. bounds predictive of practical
performance. Unfortunately, current theory does not quite have this ability. We
compare standard deep RL algorithms to prior sample complexity bounds by
introducing a new dataset, BRIDGE. It consists of 155 deterministic MDPs from
common deep RL benchmarks, along with their corresponding tabular
representations, which enables us to exactly compute instance-dependent bounds.
We choose to focus on deterministic environments because they share many
interesting properties of stochastic environments, but are easier to analyze.
Using BRIDGE, we find that prior bounds do not correlate well with when deep RL
succeeds vs. fails, but discover a surprising property that does. When actions
with the highest Q-values under the random policy also have the highest
Q-values under the optimal policy (i.e. when it is optimal to be greedy on the
random policy's Q function), deep RL tends to succeed; when they don't, deep RL
tends to fail. We generalize this property into a new complexity measure of an
MDP that we call the effective horizon, which roughly corresponds to how many
steps of lookahead search would be needed in that MDP in order to identify the
next optimal action, when leaf nodes are evaluated with random rollouts. Using
BRIDGE, we show that the effective horizon-based bounds are more closely
reflective of the empirical performance of PPO and DQN than prior sample
complexity bounds across four metrics. We also find that, unlike existing
bounds, the effective horizon can predict the effects of using reward shaping
or a pre-trained exploration policy. Our code and data are available at
https://github.com/cassidylaidlaw/effective-horizon
Related papers
- SAPG: Split and Aggregate Policy Gradients [37.433915947580076]
We propose a new on-policy RL algorithm that can effectively leverage large-scale environments by splitting them into chunks and fusing them back together via importance sampling.
Our algorithm, termed SAPG, shows significantly higher performance across a variety of challenging environments where vanilla PPO and other strong baselines fail to achieve high performance.
arXiv Detail & Related papers (2024-07-29T17:59:50Z) - Q-Star Meets Scalable Posterior Sampling: Bridging Theory and Practice via HyperAgent [23.669599662214686]
HyperAgent is a reinforcement learning (RL) algorithm based on the hypermodel framework for exploration in RL.
We demonstrate that HyperAgent offers robust performance in large-scale deep RL benchmarks.
It can solve Deep Sea hard exploration problems with episodes that optimally scale with problem size and exhibits significant efficiency gains in the Atari suite.
arXiv Detail & Related papers (2024-02-05T07:07:30Z) - The Effective Horizon Explains Deep RL Performance in Stochastic Environments [21.148001945560075]
Reinforcement learning (RL) theory has largely focused on proving mini complexity sample bounds.
We introduce a new RL algorithm, SQIRL, that iteratively learns a nearoptimal policy by exploring randomly to collect rollouts.
We leverage SQIRL to derive instance-dependent sample complexity bounds for RL that are exponential only in an "effective horizon" look-ahead and on the complexity of the class used for approximation.
arXiv Detail & Related papers (2023-12-13T18:58:56Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori.
We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z) - Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks.
Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes.
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.