Where is the Grass Greener? Revisiting Generalized Policy Iteration for
Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2107.01407v1
- Date: Sat, 3 Jul 2021 11:00:56 GMT
- Title: Where is the Grass Greener? Revisiting Generalized Policy Iteration for
Offline Reinforcement Learning
- Authors: Lionel Blond\'e, Alexandros Kalousis
- Abstract summary: We re-implement state-of-the-art baselines in the offline RL regime under a fair, unified, and highly factorized framework.
We show that when a given baseline outperforms its competing counterparts on one end of the spectrum, it never does on the other end.
- Score: 81.15016852963676
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The performance of state-of-the-art baselines in the offline RL regime varies
widely over the spectrum of dataset qualities, ranging from "far-from-optimal"
random data to "close-to-optimal" expert demonstrations. We re-implement these
under a fair, unified, and highly factorized framework, and show that when a
given baseline outperforms its competing counterparts on one end of the
spectrum, it never does on the other end. This consistent trend prevents us
from naming a victor that outperforms the rest across the board. We attribute
the asymmetry in performance between the two ends of the quality spectrum to
the amount of inductive bias injected into the agent to entice it to posit that
the behavior underlying the offline dataset is optimal for the task. The more
bias is injected, the higher the agent performs, provided the dataset is
close-to-optimal. Otherwise, its effect is brutally detrimental. Adopting an
advantage-weighted regression template as base, we conduct an investigation
which corroborates that injections of such optimality inductive bias, when not
done parsimoniously, makes the agent subpar in the datasets it was dominant as
soon as the offline policy is sub-optimal. In an effort to design methods that
perform well across the whole spectrum, we revisit the generalized policy
iteration scheme for the offline regime, and study the impact of nine distinct
newly-introduced proposal distributions over actions, involved in proposed
generalization of the policy evaluation and policy improvement update rules. We
show that certain orchestrations strike the right balance and can improve the
performance on one end of the spectrum without harming it on the other end.
Related papers
- Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement Learning [12.112619241073158]
In offline reinforcement learning, the challenge of out-of-distribution is pronounced.
Existing methods often constrain the learned policy through policy regularization.
We propose Adaptive Advantage-guided Policy Regularization (A2PR)
arXiv Detail & Related papers (2024-05-30T10:20:55Z) - Preferred-Action-Optimized Diffusion Policies for Offline Reinforcement Learning [19.533619091287676]
We propose a novel preferred-action-optimized diffusion policy for offline reinforcement learning.
In particular, an expressive conditional diffusion model is utilized to represent the diverse distribution of a behavior policy.
Experiments demonstrate that the proposed method provides competitive or superior performance compared to previous state-of-the-art offline RL methods.
arXiv Detail & Related papers (2024-05-29T03:19:59Z) - Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - Offline Imitation Learning with Suboptimal Demonstrations via Relaxed
Distribution Matching [109.5084863685397]
offline imitation learning (IL) promises the ability to learn performant policies from pre-collected demonstrations without interactions with the environment.
We present RelaxDICE, which employs an asymmetrically-relaxed f-divergence for explicit support regularization.
Our method significantly outperforms the best prior offline method in six standard continuous control environments.
arXiv Detail & Related papers (2023-03-05T03:35:11Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori.
We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.