Related papers: Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning

Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning

URL: http://arxiv.org/abs/2405.14335v2
Date: Wed, 30 Oct 2024 21:53:43 GMT
Title: Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning
Authors: Otmane Sakhi, Imad Aouali, Pierre Alquier, Nicolas Chopin,
Abstract summary: This work investigates the offline formulation of the contextual bandit problem. The goal is to leverage past interactions collected under a behavior policy to evaluate, select, and learn new, potentially better-performing, policies. We introduce novel, fully empirical concentration bounds for a broad class of importance weighting risk estimators.
Score: 7.085987593010675
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work investigates the offline formulation of the contextual bandit problem, where the goal is to leverage past interactions collected under a behavior policy to evaluate, select, and learn new, potentially better-performing, policies. Motivated by critical applications, we move beyond point estimators. Instead, we adopt the principle of pessimism where we construct upper bounds that assess a policy's worst-case performance, enabling us to confidently select and learn improved policies. Precisely, we introduce novel, fully empirical concentration bounds for a broad class of importance weighting risk estimators. These bounds are general enough to cover most existing estimators and pave the way for the development of new ones. In particular, our pursuit of the tightest bound within this class motivates a novel estimator (LS), that logarithmically smooths large importance weights. The bound for LS is provably tighter than its competitors, and naturally results in improved policy selection and learning strategies. Extensive policy evaluation, selection, and learning experiments highlight the versatility and favorable performance of LS.

Related papers

Reinforcement Learning with Continuous Actions Under Unmeasured Confounding [14.510042451844766]
This paper addresses the challenge of offline policy learning in reinforcement learning with continuous action spaces. We develop a minimax estimator and introduce a policy-gradient-based algorithm to identify the in-class optimal policy. We provide theoretical results regarding the consistency, finite-sample error bound, and regret bound of the resulting optimal policy.
arXiv Detail & Related papers (2025-05-01T04:55:29Z)
Provable Zero-Shot Generalization in Offline Reinforcement Learning [55.169228792596805]
We study offline reinforcement learning with zero-shot generalization property (ZSG) Existing work showed that classical offline RL fails to generalize to new, unseen environments. We show that both PERM and PPPO are capable of finding a near-optimal policy with ZSG.
arXiv Detail & Related papers (2025-03-11T02:44:32Z)
Optimal Policy Adaptation under Covariate Shift [15.703626346971182]
We propose principled approaches for learning the optimal policy in the target domain by leveraging two datasets. We derive the identifiability assumptions for the reward induced by a given policy. We then learn the optimal policy by optimizing the estimated reward.
arXiv Detail & Related papers (2025-01-14T12:33:02Z)
Off-Policy Evaluation and Counterfactual Methods in Dynamic Auction Environments [0.6445605125467574]
Off-Policy Evaluation allows researchers to assess new policies without costly experiments, speeding up the evaluation process. Online experimental methods, such as A/B tests, are effective but often slow, thus delaying the policy selection and optimization process. By utilizing counterfactual estimators as a preliminary step before conducting A/B tests, we aim to streamline the evaluation process.
arXiv Detail & Related papers (2025-01-09T14:39:40Z)
OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators [13.408838970377035]
offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance. We propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure. Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.
arXiv Detail & Related papers (2024-05-27T23:51:20Z)
Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems. The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy. We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z)
Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded. We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z)
Constrained Policy Optimization for Controlled Self-Learning in Conversational AI Systems [18.546197100318693]
We introduce a scalable framework for supporting fine-grained exploration targets for individual domains via user-defined constraints. We present a novel meta-gradient learning approach that is scalable and practical to address this problem. We conduct extensive experiments using data from a real-world conversational AI on a set of realistic constraint benchmarks.
arXiv Detail & Related papers (2022-09-17T23:44:13Z)
Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk. Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z)
Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions. We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions. We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z)
Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria. We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy. We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z)
Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks. Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes. We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z)
Expert-Supervised Reinforcement Learning for Offline Policy Learning and Evaluation [21.703965401500913]
We propose an Expert-Supervised RL (ESRL) framework which uses uncertainty quantification for offline policy learning. In particular, we have three contributions: 1) the method can learn safe and optimal policies through hypothesis testing, 2) ESRL allows for different levels of risk averse implementations tailored to the application context, and 3) we propose a way to interpret ESRL's policy at every state through posterior distributions.
arXiv Detail & Related papers (2020-06-23T17:43:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.