Exponential Smoothing for Off-Policy Learning
- URL: http://arxiv.org/abs/2305.15877v2
- Date: Mon, 5 Jun 2023 13:05:59 GMT
- Title: Exponential Smoothing for Off-Policy Learning
- Authors: Imad Aouali, Victor-Emmanuel Brunel, David Rohde, Anna Korba
- Abstract summary: We derive a two-sided PAC-Bayes generalization bound for inverse propensity scoring (IPS)
The bound is tractable, scalable, interpretable and provides learning certificates.
- Score: 16.284314586358928
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy learning (OPL) aims at finding improved policies from logged
bandit data, often by minimizing the inverse propensity scoring (IPS) estimator
of the risk. In this work, we investigate a smooth regularization for IPS, for
which we derive a two-sided PAC-Bayes generalization bound. The bound is
tractable, scalable, interpretable and provides learning certificates. In
particular, it is also valid for standard IPS without making the assumption
that the importance weights are bounded. We demonstrate the relevance of our
approach and its favorable performance through a set of learning tasks. Since
our bound holds for standard IPS, we are able to provide insight into when
regularizing IPS is useful. Namely, we identify cases where regularization
might not be needed. This goes against the belief that, in practice, clipped
IPS often enjoys favorable performance than standard IPS in OPL.
Related papers
- Unified PAC-Bayesian Study of Pessimism for Offline Policy Learning with Regularized Importance Sampling [13.001601860404426]
We introduce a tractable PAC-Bayesian generalization bound that universally applies to common importance weight regularizations.
Our results challenge common understanding, demonstrating the effectiveness of standard IW regularization techniques.
arXiv Detail & Related papers (2024-06-05T16:32:14Z) - Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement Learning [12.112619241073158]
In offline reinforcement learning, the challenge of out-of-distribution is pronounced.
Existing methods often constrain the learned policy through policy regularization.
We propose Adaptive Advantage-guided Policy Regularization (A2PR)
arXiv Detail & Related papers (2024-05-30T10:20:55Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Off-Policy Evaluation of Slate Bandit Policies via Optimizing
Abstraction [22.215852332444907]
We study the problem of slate contextual bandits where a policy selects multi-dimensional actions known as slates.
The typical Inverse Propensity Scoring (IPS) estimator suffers from substantial variance due to large action spaces.
We develop a novel estimator for OPE of slate bandits, called Latent IPS (LIPS), which defines importance weights in a low-dimensional slate abstraction space.
arXiv Detail & Related papers (2024-02-03T14:38:09Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Oracle-Efficient Pessimism: Offline Policy Optimization in Contextual
Bandits [82.28442917447643]
We present the first general oracle-efficient algorithm for pessimistic OPO.
We obtain statistical guarantees analogous to those for prior pessimistic approaches.
We show advantage over unregularized OPO across a wide range of configurations.
arXiv Detail & Related papers (2023-06-13T17:29:50Z) - Safe Deployment for Counterfactual Learning to Rank with Exposure-Based
Risk Minimization [63.93275508300137]
We introduce a novel risk-aware Counterfactual Learning To Rank method with theoretical guarantees for safe deployment.
Our experimental results demonstrate the efficacy of our proposed method, which is effective at avoiding initial periods of bad performance when little data is available.
arXiv Detail & Related papers (2023-04-26T15:54:23Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - State-Aware Proximal Pessimistic Algorithms for Offline Reinforcement
Learning [36.34691755377286]
Pessimism is of great importance in offline reinforcement learning (RL)
We propose a principled algorithmic framework for offline RL, called emphState-Aware Proximal Pessimism (SA-PP)
arXiv Detail & Related papers (2022-11-28T04:56:40Z) - You May Not Need Ratio Clipping in PPO [117.03368180633463]
Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data.
Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples.
We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios.
We show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.
arXiv Detail & Related papers (2022-01-31T20:26:56Z) - BRAC+: Improved Behavior Regularized Actor Critic for Offline
Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets.
Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions.
We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.