Importance-Weighted Offline Learning Done Right
- URL: http://arxiv.org/abs/2309.15771v1
- Date: Wed, 27 Sep 2023 16:42:10 GMT
- Title: Importance-Weighted Offline Learning Done Right
- Authors: Germano Gabbianelli, Gergely Neu, Matteo Papini
- Abstract summary: We study the problem of offline policy optimization in contextual bandit problems.
The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy.
We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
- Score: 16.4989952150404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of offline policy optimization in stochastic contextual
bandit problems, where the goal is to learn a near-optimal policy based on a
dataset of decision data collected by a suboptimal behavior policy. Rather than
making any structural assumptions on the reward function, we assume access to a
given policy class and aim to compete with the best comparator policy within
this class. In this setting, a standard approach is to compute
importance-weighted estimators of the value of each policy, and select a policy
that minimizes the estimated value up to a "pessimistic" adjustment subtracted
from the estimates to reduce their random fluctuations. In this paper, we show
that a simple alternative approach based on the "implicit exploration"
estimator of \citet{Neu2015} yields performance guarantees that are superior in
nearly all possible terms to all previous results. Most notably, we remove an
extremely restrictive "uniform coverage" assumption made in all previous works.
These improvements are made possible by the observation that the upper and
lower tails importance-weighted estimators behave very differently from each
other, and their careful control can massively improve on previous results that
were all based on symmetric two-sided concentration inequalities. We also
extend our results to infinite policy classes in a PAC-Bayesian fashion, and
showcase the robustness of our algorithm to the choice of hyper-parameters by
means of numerical simulations.
Related papers
- Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - Bi-Level Offline Policy Optimization with Limited Exploration [1.8130068086063336]
We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset.
We propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level)
We evaluate our model using a blend of synthetic, benchmark, and real-world datasets for offline RL, showing that it performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2023-10-10T02:45:50Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Off-Policy Evaluation with Policy-Dependent Optimization Response [90.28758112893054]
We develop a new framework for off-policy evaluation with a textitpolicy-dependent linear optimization response.
We construct unbiased estimators for the policy-dependent estimand by a perturbation method.
We provide a general algorithm for optimizing causal interventions.
arXiv Detail & Related papers (2022-02-25T20:25:37Z) - Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium
Learning from Offline Datasets [101.5329678997916]
We study episodic two-player zero-sum Markov games (MGs) in the offline setting.
The goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori.
arXiv Detail & Related papers (2022-02-15T15:39:30Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Robust Batch Policy Learning in Markov Decision Processes [0.0]
We study the offline data-driven sequential decision making problem in the framework of Markov decision process (MDP)
We propose to evaluate each policy by a set of the average rewards with respect to distributions centered at the policy induced stationary distribution.
arXiv Detail & Related papers (2020-11-09T04:41:21Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - The Importance of Pessimism in Fixed-Dataset Policy Optimization [32.22700716592194]
We study worst-case guarantees on the expected return of fixed-dataset policy optimization algorithms.
For naive approaches, the possibility of erroneous value overestimation leads to a difficult-to-satisfy requirement.
We show why pessimistic algorithms can achieve good performance even when the dataset is not informative of every policy.
arXiv Detail & Related papers (2020-09-15T00:18:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.