Importance-Weighted Offline Learning Done Right
        - URL: http://arxiv.org/abs/2309.15771v1
- Date: Wed, 27 Sep 2023 16:42:10 GMT
- Title: Importance-Weighted Offline Learning Done Right
- Authors: Germano Gabbianelli, Gergely Neu, Matteo Papini
- Abstract summary: We study the problem of offline policy optimization in contextual bandit problems.
The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy.
We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
- Score: 16.4989952150404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   We study the problem of offline policy optimization in stochastic contextual
bandit problems, where the goal is to learn a near-optimal policy based on a
dataset of decision data collected by a suboptimal behavior policy. Rather than
making any structural assumptions on the reward function, we assume access to a
given policy class and aim to compete with the best comparator policy within
this class. In this setting, a standard approach is to compute
importance-weighted estimators of the value of each policy, and select a policy
that minimizes the estimated value up to a "pessimistic" adjustment subtracted
from the estimates to reduce their random fluctuations. In this paper, we show
that a simple alternative approach based on the "implicit exploration"
estimator of \citet{Neu2015} yields performance guarantees that are superior in
nearly all possible terms to all previous results. Most notably, we remove an
extremely restrictive "uniform coverage" assumption made in all previous works.
These improvements are made possible by the observation that the upper and
lower tails importance-weighted estimators behave very differently from each
other, and their careful control can massively improve on previous results that
were all based on symmetric two-sided concentration inequalities. We also
extend our results to infinite policy classes in a PAC-Bayesian fashion, and
showcase the robustness of our algorithm to the choice of hyper-parameters by
means of numerical simulations.
 
      
        Related papers
        - Convergence and Sample Complexity of First-Order Methods for Agnostic   Reinforcement Learning [66.4260157478436]
 We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
 arXiv  Detail & Related papers  (2025-07-06T14:40:05Z)
- Quantile-Optimal Policy Learning under Unmeasured Confounding [55.72891849926314]
 We study quantile-optimal policy learning where the goal is to find a policy whose reward distribution has the largest $alpha$-quantile for some $alpha in (0, 1)$.<n>Such a problem suffers from three main challenges: (i) nonlinearity of the quantile objective as a functional of the reward distribution, (ii) unobserved confounding issue, and (iii) insufficient coverage of the offline dataset.
 arXiv  Detail & Related papers  (2025-06-08T13:37:38Z)
- Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward   POMDPs with Known Observation Models [56.92178753201331]
 We tackle average-reward infinite-horizon POMDPs with an unknown transition model.
We present a novel and simple estimator that overcomes this barrier.
 arXiv  Detail & Related papers  (2025-01-30T22:29:41Z)
- Optimal Policy Adaptation under Covariate Shift [15.703626346971182]
 We propose principled approaches for learning the optimal policy in the target domain by leveraging two datasets.
We derive the identifiability assumptions for the reward induced by a given policy.
We then learn the optimal policy by optimizing the estimated reward.
 arXiv  Detail & Related papers  (2025-01-14T12:33:02Z)
- Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
 We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
 arXiv  Detail & Related papers  (2024-05-09T12:52:22Z)
- Bi-Level Offline Policy Optimization with Limited Exploration [1.8130068086063336]
 We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset.
We propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level)
We evaluate our model using a blend of synthetic, benchmark, and real-world datasets for offline RL, showing that it performs competitively with state-of-the-art methods.
 arXiv  Detail & Related papers  (2023-10-10T02:45:50Z)
- Policy learning "without" overlap: Pessimism and generalized empirical   Bernstein's inequality [94.89246810243053]
 This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
 arXiv  Detail & Related papers  (2022-12-19T22:43:08Z)
- Off-Policy Evaluation with Policy-Dependent Optimization Response [90.28758112893054]
 We develop a new framework for off-policy evaluation with a textitpolicy-dependent linear optimization response.
We construct unbiased estimators for the policy-dependent estimand by a perturbation method.
We provide a general algorithm for optimizing causal interventions.
 arXiv  Detail & Related papers  (2022-02-25T20:25:37Z)
- Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium
  Learning from Offline Datasets [101.5329678997916]
 We study episodic two-player zero-sum Markov games (MGs) in the offline setting.
The goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori.
 arXiv  Detail & Related papers  (2022-02-15T15:39:30Z)
- Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
  Globally Optimal Policy [95.98698822755227]
 We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
 arXiv  Detail & Related papers  (2020-12-28T05:02:26Z)
- Robust Batch Policy Learning in Markov Decision Processes [0.0]
 We study the offline data-driven sequential decision making problem in the framework of Markov decision process (MDP)
We propose to evaluate each policy by a set of the average rewards with respect to distributions centered at the policy induced stationary distribution.
 arXiv  Detail & Related papers  (2020-11-09T04:41:21Z)
- Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
 In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
 arXiv  Detail & Related papers  (2020-11-08T23:16:19Z)
- The Importance of Pessimism in Fixed-Dataset Policy Optimization [32.22700716592194]
 We study worst-case guarantees on the expected return of fixed-dataset policy optimization algorithms.
For naive approaches, the possibility of erroneous value overestimation leads to a difficult-to-satisfy requirement.
We show why pessimistic algorithms can achieve good performance even when the dataset is not informative of every policy.
 arXiv  Detail & Related papers  (2020-09-15T00:18:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.