Related papers: Unified PAC-Bayesian Study of Pessimism for Offline Policy Learning with Regularized Importance Sampling

Unified PAC-Bayesian Study of Pessimism for Offline Policy Learning with Regularized Importance Sampling

URL: http://arxiv.org/abs/2406.03434v1
Date: Wed, 5 Jun 2024 16:32:14 GMT
Title: Unified PAC-Bayesian Study of Pessimism for Offline Policy Learning with Regularized Importance Sampling
Authors: Imad Aouali, Victor-Emmanuel Brunel, David Rohde, Anna Korba,
Abstract summary: We introduce a tractable PAC-Bayesian generalization bound that universally applies to common importance weight regularizations. Our results challenge common understanding, demonstrating the effectiveness of standard IW regularization techniques.
Score: 13.001601860404426
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Off-policy learning (OPL) often involves minimizing a risk estimator based on importance weighting to correct bias from the logging policy used to collect data. However, this method can produce an estimator with a high variance. A common solution is to regularize the importance weights and learn the policy by minimizing an estimator with penalties derived from generalization bounds specific to the estimator. This approach, known as pessimism, has gained recent attention but lacks a unified framework for analysis. To address this gap, we introduce a comprehensive PAC-Bayesian framework to examine pessimism with regularized importance weighting. We derive a tractable PAC-Bayesian generalization bound that universally applies to common importance weight regularizations, enabling their comparison within a single framework. Our empirical results challenge common understanding, demonstrating the effectiveness of standard IW regularization techniques.

Related papers

Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL [6.224756774400233]
We study offline reinforcement learning in average-reward MDPs, which presents increased challenges from the perspectives of distribution shift and non-uniform coverage.<n>We develop sharp guarantees depending only on the target policy, specifically the bias span and a novel policy hitting radius, yielding the first fully single-policy sample complexity bound for average-reward offline RL.
arXiv Detail & Related papers (2025-06-26T00:22:39Z)
Rethinking Robustness in Machine Learning: A Posterior Agreement Approach [45.284633306624634]
Posterior Agreement (PA) theory of model validation provides a principled framework for robustness evaluation. We show that the PA metric provides a sensible and consistent analysis of the vulnerabilities in learning algorithms, even in the presence of few observations.
arXiv Detail & Related papers (2025-03-20T16:03:39Z)
Statistical Analysis of Policy Space Compression Problem [54.1754937830779]
Policy search methods are crucial in reinforcement learning, offering a framework to address continuous state-action and partially observable problems. Reducing the policy space through policy compression emerges as a powerful, reward-free approach to accelerate the learning process. This technique condenses the policy space into a smaller, representative set while maintaining most of the original effectiveness.
arXiv Detail & Related papers (2024-11-15T02:46:55Z)
Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies. Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z)
Domain Generalization without Excess Empirical Risk [83.26052467843725]
A common approach is designing a data-driven surrogate penalty to capture generalization and minimize the empirical risk jointly with the penalty. We argue that a significant failure mode of this recipe is an excess risk due to an erroneous penalty or hardness in joint optimization. We present an approach that eliminates this problem. Instead of jointly minimizing empirical risk with the penalty, we minimize the penalty under the constraint of optimality of the empirical risk.
arXiv Detail & Related papers (2023-08-30T08:46:46Z)
Exponential Smoothing for Off-Policy Learning [16.284314586358928]
We derive a two-sided PAC-Bayes generalization bound for inverse propensity scoring (IPS) The bound is tractable, scalable, interpretable and provides learning certificates.
arXiv Detail & Related papers (2023-05-25T09:18:45Z)
A Unified Framework of Policy Learning for Contextual Bandit with Confounding Bias and Missing Observations [108.89353070722497]
We study the offline contextual bandit problem, where we aim to acquire an optimal policy using observational data. We present a new algorithm called Causal-Adjusted Pessimistic (CAP) policy learning, which forms the reward function as the solution of an integral equation system.
arXiv Detail & Related papers (2023-03-20T15:17:31Z)
Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning. Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z)
Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded. We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z)
Learning Stochastic Majority Votes by Minimizing a PAC-Bayes Generalization Bound [15.557653926558638]
We investigate a counterpart of majority votes over finite ensembles of classifiers, and study its generalization properties. We instantiate it with Dirichlet distributions: this allows for a closed-form and differentiable expression for the expected risk. The resulting majority vote learning algorithm achieves state-of-the-art accuracy and benefits from (non-vacuous) tight bounds.
arXiv Detail & Related papers (2021-06-23T16:57:23Z)
A PAC-Bayes Analysis of Adversarial Robustness [0.0]
We propose the first general PAC-Bayesian bounds generalization for adversarial robustness. We leverage the PAC-Bayesian framework to bound the averaged risk on the perturbations for majority votes.
arXiv Detail & Related papers (2021-02-19T10:23:48Z)
PAC-Bayes unleashed: generalisation bounds with unbounded losses [12.078257783674923]
We present new PAC-Bayesian generalisation bounds for learning problems with unbounded loss functions. This extends the relevance and applicability of the PAC-Bayes learning framework.
arXiv Detail & Related papers (2020-06-12T15:55:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.