Related papers: Policy learning "without'' overlap: Pessimism and generalized empirical Bernstein's inequality

Policy learning "without'' overlap: Pessimism and generalized empirical Bernstein's inequality

URL: http://arxiv.org/abs/2212.09900v1
Date: Mon, 19 Dec 2022 22:43:08 GMT
Title: Policy learning "without'' overlap: Pessimism and generalized empirical Bernstein's inequality
Authors: Ying Jin, Zhimei Ren, Zhuoran Yang, Zhaoran Wang
Abstract summary: offline policy learning aims at utilizing observations collected a priori to learn an optimal individualized decision rule. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics are lower bounded in the offline dataset. We propose a new algorithm that optimize lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values.
Score: 107.84979976896912
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper studies offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn an optimal individualized decision rule that achieves the best overall outcomes for a given population. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics are lower bounded in the offline dataset; put differently, the performance of the existing methods depends on the worst-case propensity in the offline dataset. As one has no control over the data collection process, this assumption can be unrealistic in many situations, especially when the behavior policies are allowed to evolve over time with diminishing propensities for certain actions. In this paper, we propose a new algorithm that optimizes lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values. The LCBs are constructed using knowledge of the behavior policies for collecting the offline data. Without assuming any uniform overlap condition, we establish a data-dependent upper bound for the suboptimality of our algorithm, which only depends on (i) the overlap for the optimal policy, and (ii) the complexity of the policy class we optimize over. As an implication, for adaptively collected data, we ensure efficient policy learning as long as the propensities for optimal actions are lower bounded over time, while those for suboptimal ones are allowed to diminish arbitrarily fast. In our theoretical analysis, we develop a new self-normalized type concentration inequality for inverse-propensity-weighting estimators, generalizing the well-known empirical Bernstein's inequality to unbounded and non-i.i.d. data.

Related papers

Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z)
Reinforcement Learning for Individual Optimal Policy from Heterogeneous Data [3.6714630660726586]
offline reinforcement learning (RL) aims to find optimal policies in dynamic environments in order to maximize the expected total rewards by leveraging pre-collected data.<n>Traditional methods focus on learning an optimal policy for all individuals with pre-collected data from a single episode or homogeneous batch episodes.<n>We propose an individualized offline policy optimization framework for heterogeneous time-stationary Markov decision processes.
arXiv Detail & Related papers (2025-05-14T15:44:10Z)
Behavior Preference Regression for Offline Reinforcement Learning [0.0]
offline reinforcement learning (RL) methods aim to learn optimal policies with access only to trajectories in a fixed dataset. Policy constraint methods formulate policy learning as an optimization problem that balances maximizing reward with minimizing deviation from the policy. We adapt this approach of paired comparison to Behavior Regression Preference for offline RL. We empirically evaluate BPR on the widely used D4RL Locomotion and Antmaze datasets, as well as the more challenging V-D4RL suite.
arXiv Detail & Related papers (2025-03-02T15:13:02Z)
Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion [44.95386817008473]
We introduce Contrastive Policy Gradient, or CoPG, a simple and mathematically principled new RL algorithm that can estimate the optimal policy even from off-policy data. We show this approach to generalize the direct alignment method IPO (identity preference optimization) and classic policy gradient. We experiment with the proposed CoPG on a toy bandit problem to illustrate its properties, as well as for finetuning LLMs on a summarization task.
arXiv Detail & Related papers (2024-06-27T14:03:49Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Bi-Level Offline Policy Optimization with Limited Exploration [1.8130068086063336]
We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset. We propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level) We evaluate our model using a blend of synthetic, benchmark, and real-world datasets for offline RL, showing that it performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2023-10-10T02:45:50Z)
Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems. The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy. We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z)
Iteratively Refined Behavior Regularization for Offline Reinforcement Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration. By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement. Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z)
Offline Reinforcement Learning with Closed-Form Policy Improvement Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. In this paper, we propose our closed-form policy improvement operators. We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z)
Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets [101.5329678997916]
We study episodic two-player zero-sum Markov games (MGs) in the offline setting. The goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori.
arXiv Detail & Related papers (2022-02-15T15:39:30Z)
Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori. We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z)
The Importance of Pessimism in Fixed-Dataset Policy Optimization [32.22700716592194]
We study worst-case guarantees on the expected return of fixed-dataset policy optimization algorithms. For naive approaches, the possibility of erroneous value overestimation leads to a difficult-to-satisfy requirement. We show why pessimistic algorithms can achieve good performance even when the dataset is not informative of every policy.
arXiv Detail & Related papers (2020-09-15T00:18:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.