Bi-Level Offline Policy Optimization with Limited Exploration
- URL: http://arxiv.org/abs/2310.06268v1
- Date: Tue, 10 Oct 2023 02:45:50 GMT
- Title: Bi-Level Offline Policy Optimization with Limited Exploration
- Authors: Wenzhuo Zhou
- Abstract summary: We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset.
We propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level)
We evaluate our model using a blend of synthetic, benchmark, and real-world datasets for offline RL, showing that it performs competitively with state-of-the-art methods.
- Score: 1.8130068086063336
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study offline reinforcement learning (RL) which seeks to learn a good
policy based on a fixed, pre-collected dataset. A fundamental challenge behind
this task is the distributional shift due to the dataset lacking sufficient
exploration, especially under function approximation. To tackle this issue, we
propose a bi-level structured policy optimization algorithm that models a
hierarchical interaction between the policy (upper-level) and the value
function (lower-level). The lower level focuses on constructing a confidence
set of value estimates that maintain sufficiently small weighted average
Bellman errors, while controlling uncertainty arising from distribution
mismatch. Subsequently, at the upper level, the policy aims to maximize a
conservative value estimate from the confidence set formed at the lower level.
This novel formulation preserves the maximum flexibility of the implicitly
induced exploratory data distribution, enabling the power of model
extrapolation. In practice, it can be solved through a computationally
efficient, penalized adversarial estimation procedure. Our theoretical regret
guarantees do not rely on any data-coverage and completeness-type assumptions,
only requiring realizability. These guarantees also demonstrate that the
learned policy represents the "best effort" among all policies, as no other
policies can outperform it. We evaluate our model using a blend of synthetic,
benchmark, and real-world datasets for offline RL, showing that it performs
competitively with state-of-the-art methods.
Related papers
- Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems.
The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy.
We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium
Learning from Offline Datasets [101.5329678997916]
We study episodic two-player zero-sum Markov games (MGs) in the offline setting.
The goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori.
arXiv Detail & Related papers (2022-02-15T15:39:30Z) - Risk Minimization from Adaptively Collected Data: Guarantees for
Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data.
We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class.
For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z) - Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and
Dual Bounds [21.520045697447372]
Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies.
This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation.
We develop a practical algorithm through a primal-dual optimization-based approach.
arXiv Detail & Related papers (2021-03-09T22:31:20Z) - Reducing Conservativeness Oriented Offline Reinforcement Learning [29.895142928565228]
In offline reinforcement learning, a policy learns to maximize cumulative rewards with a fixed collection of data.
We propose the method of reducing conservativeness oriented reinforcement learning.
Our proposed method is able to tackle the skewed distribution of the provided dataset and derive a value function closer to the expected value function.
arXiv Detail & Related papers (2021-02-27T01:21:01Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Distributionally Robust Batch Contextual Bandits [20.667213458836734]
Policy learning using historical observational data is an important problem that has found widespread applications.
Existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment.
In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data.
arXiv Detail & Related papers (2020-06-10T03:11:40Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.