Related papers: Exponential Smoothing for Off-Policy Learning

Exponential Smoothing for Off-Policy Learning

URL: http://arxiv.org/abs/2305.15877v2
Date: Mon, 5 Jun 2023 13:05:59 GMT
Title: Exponential Smoothing for Off-Policy Learning
Authors: Imad Aouali, Victor-Emmanuel Brunel, David Rohde, Anna Korba
Abstract summary: We derive a two-sided PAC-Bayes generalization bound for inverse propensity scoring (IPS) The bound is tractable, scalable, interpretable and provides learning certificates.
Score: 16.284314586358928
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Off-policy learning (OPL) aims at finding improved policies from logged bandit data, often by minimizing the inverse propensity scoring (IPS) estimator of the risk. In this work, we investigate a smooth regularization for IPS, for which we derive a two-sided PAC-Bayes generalization bound. The bound is tractable, scalable, interpretable and provides learning certificates. In particular, it is also valid for standard IPS without making the assumption that the importance weights are bounded. We demonstrate the relevance of our approach and its favorable performance through a set of learning tasks. Since our bound holds for standard IPS, we are able to provide insight into when regularizing IPS is useful. Namely, we identify cases where regularization might not be needed. This goes against the belief that, in practice, clipped IPS often enjoys favorable performance than standard IPS in OPL.

Related papers

Unified PAC-Bayesian Study of Pessimism for Offline Policy Learning with Regularized Importance Sampling [13.001601860404426]
We introduce a tractable PAC-Bayesian generalization bound that universally applies to common importance weight regularizations. Our results challenge common understanding, demonstrating the effectiveness of standard IW regularization techniques.
arXiv Detail & Related papers (2024-06-05T16:32:14Z)
Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement Learning [12.112619241073158]
In offline reinforcement learning, the challenge of out-of-distribution is pronounced. Existing methods often constrain the learned policy through policy regularization. We propose Adaptive Advantage-guided Policy Regularization (A2PR)
arXiv Detail & Related papers (2024-05-30T10:20:55Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction [22.215852332444907]
We study the problem of slate contextual bandits where a policy selects multi-dimensional actions known as slates. The typical Inverse Propensity Scoring (IPS) estimator suffers from substantial variance due to large action spaces. We develop a novel estimator for OPE of slate bandits, called Latent IPS (LIPS), which defines importance weights in a low-dimensional slate abstraction space.
arXiv Detail & Related papers (2024-02-03T14:38:09Z)
Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error. In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z)
Oracle-Efficient Pessimism: Offline Policy Optimization in Contextual Bandits [82.28442917447643]
We present the first general oracle-efficient algorithm for pessimistic OPO. We obtain statistical guarantees analogous to those for prior pessimistic approaches. We show advantage over unregularized OPO across a wide range of configurations.
arXiv Detail & Related papers (2023-06-13T17:29:50Z)
Safe Deployment for Counterfactual Learning to Rank with Exposure-Based Risk Minimization [63.93275508300137]
We introduce a novel risk-aware Counterfactual Learning To Rank method with theoretical guarantees for safe deployment. Our experimental results demonstrate the efficacy of our proposed method, which is effective at avoiding initial periods of bad performance when little data is available.
arXiv Detail & Related papers (2023-04-26T15:54:23Z)
Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded. We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z)
State-Aware Proximal Pessimistic Algorithms for Offline Reinforcement Learning [36.34691755377286]
Pessimism is of great importance in offline reinforcement learning (RL) We propose a principled algorithmic framework for offline RL, called emphState-Aware Proximal Pessimism (SA-PP)
arXiv Detail & Related papers (2022-11-28T04:56:40Z)
You May Not Need Ratio Clipping in PPO [117.03368180633463]
Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data. Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples. We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios. We show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.
arXiv Detail & Related papers (2022-01-31T20:26:56Z)
BRAC+: Improved Behavior Regularized Actor Critic for Offline Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets. Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions. We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.