Policy learning "without'' overlap: Pessimism and generalized empirical
Bernstein's inequality
- URL: http://arxiv.org/abs/2212.09900v1
- Date: Mon, 19 Dec 2022 22:43:08 GMT
- Title: Policy learning "without'' overlap: Pessimism and generalized empirical
Bernstein's inequality
- Authors: Ying Jin, Zhimei Ren, Zhuoran Yang, Zhaoran Wang
- Abstract summary: offline policy learning aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics are lower bounded in the offline dataset.
We propose a new algorithm that optimize lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values.
- Score: 107.84979976896912
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies offline policy learning, which aims at utilizing
observations collected a priori (from either fixed or adaptively evolving
behavior policies) to learn an optimal individualized decision rule that
achieves the best overall outcomes for a given population. Existing policy
learning methods rely on a uniform overlap assumption, i.e., the propensities
of exploring all actions for all individual characteristics are lower bounded
in the offline dataset; put differently, the performance of the existing
methods depends on the worst-case propensity in the offline dataset. As one has
no control over the data collection process, this assumption can be unrealistic
in many situations, especially when the behavior policies are allowed to evolve
over time with diminishing propensities for certain actions.
In this paper, we propose a new algorithm that optimizes lower confidence
bounds (LCBs) -- instead of point estimates -- of the policy values. The LCBs
are constructed using knowledge of the behavior policies for collecting the
offline data. Without assuming any uniform overlap condition, we establish a
data-dependent upper bound for the suboptimality of our algorithm, which only
depends on (i) the overlap for the optimal policy, and (ii) the complexity of
the policy class we optimize over. As an implication, for adaptively collected
data, we ensure efficient policy learning as long as the propensities for
optimal actions are lower bounded over time, while those for suboptimal ones
are allowed to diminish arbitrarily fast. In our theoretical analysis, we
develop a new self-normalized type concentration inequality for
inverse-propensity-weighting estimators, generalizing the well-known empirical
Bernstein's inequality to unbounded and non-i.i.d. data.
Related papers
- Bi-Level Offline Policy Optimization with Limited Exploration [1.8130068086063336]
We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset.
We propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level)
We evaluate our model using a blend of synthetic, benchmark, and real-world datasets for offline RL, showing that it performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2023-10-10T02:45:50Z) - Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems.
The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy.
We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z) - Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Offline Imitation Learning with Suboptimal Demonstrations via Relaxed
Distribution Matching [109.5084863685397]
offline imitation learning (IL) promises the ability to learn performant policies from pre-collected demonstrations without interactions with the environment.
We present RelaxDICE, which employs an asymmetrically-relaxed f-divergence for explicit support regularization.
Our method significantly outperforms the best prior offline method in six standard continuous control environments.
arXiv Detail & Related papers (2023-03-05T03:35:11Z) - Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design [18.326126953667842]
We propose novel methods that improve the data efficiency of online Monte Carlo estimators.
We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator.
We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data.
arXiv Detail & Related papers (2023-01-31T16:12:31Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium
Learning from Offline Datasets [101.5329678997916]
We study episodic two-player zero-sum Markov games (MGs) in the offline setting.
The goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori.
arXiv Detail & Related papers (2022-02-15T15:39:30Z) - Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori.
We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.