Offline Reinforcement Learning with Additional Covering Distributions
- URL: http://arxiv.org/abs/2305.12679v2
- Date: Wed, 24 May 2023 13:25:31 GMT
- Title: Offline Reinforcement Learning with Additional Covering Distributions
- Authors: Chenjie Mao
- Abstract summary: We study learning optimal policies from a logged dataset, i.e., offline RL, with function approximation.
We show that sample-efficient offline RL for general MDPs is possible with only a partial coverage dataset and weak realizable function classes.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study learning optimal policies from a logged dataset, i.e., offline RL,
with function approximation. Despite the efforts devoted, existing algorithms
with theoretic finite-sample guarantees typically assume exploratory data
coverage or strong realizable function classes, which is hard to be satisfied
in reality. While there are recent works that successfully tackle these strong
assumptions, they either require the gap assumptions that only could be
satisfied by part of MDPs or use the behavior regularization that makes the
optimality of learned policy even intractable. To solve this challenge, we
provide finite-sample guarantees for a simple algorithm based on marginalized
importance sampling (MIS), showing that sample-efficient offline RL for general
MDPs is possible with only a partial coverage dataset and weak realizable
function classes given additional side information of a covering distribution.
Furthermore, we demonstrate that the covering distribution trades off prior
knowledge of the optimal trajectories against the coverage requirement of the
dataset, revealing the effect of this inductive bias in the learning processes.
Related papers
- Offline RL via Feature-Occupancy Gradient Ascent [9.983014605039658]
We study offline Reinforcement Learning in large infinite-horizon discounted Markov Decision Processes (MDPs)
We develop a new algorithm that performs a form of gradient ascent in the space of feature occupancies.
We show that the resulting simple algorithm satisfies strong computational and sample complexity guarantees.
arXiv Detail & Related papers (2024-05-22T15:39:05Z) - Pessimistic Causal Reinforcement Learning with Mediators for Confounded Offline Data [17.991833729722288]
We propose a novel policy learning algorithm, PESsimistic CAusal Learning (PESCAL)
Our key observation is that, by incorporating auxiliary variables that mediate the effect of actions on system dynamics, it is sufficient to learn a lower bound of the mediator distribution function, instead of the Q-function.
We provide theoretical guarantees for the algorithms we propose, and demonstrate their efficacy through simulations, as well as real-world experiments utilizing offline datasets from a leading ride-hailing platform.
arXiv Detail & Related papers (2024-03-18T14:51:19Z) - Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data [28.445166861907495]
We develop theory for the TMIS Offline Policy Evaluation (OPE) estimator.
We derive high-probability, instance-dependent bounds on its estimation error.
We also recover minimax-optimal offline learning in the adaptive setting.
arXiv Detail & Related papers (2023-06-24T21:48:28Z) - Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback.
We consider the general reward setting where the reward can be defined over the whole trajectory.
We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z) - Optimal Conservative Offline RL with General Function Approximation via
Augmented Lagrangian [18.2080757218886]
offline reinforcement learning (RL) refers to decision-making from a previously-collected dataset of interactions.
We present the first set of offline RL algorithms that are statistically optimal and practical under general function approximation and single-policy concentrability.
arXiv Detail & Related papers (2022-11-01T19:28:48Z) - Offline Reinforcement Learning with Instrumental Variables in Confounded
Markov Decision Processes [93.61202366677526]
We study the offline reinforcement learning (RL) in the face of unmeasured confounders.
We propose various policy learning methods with the finite-sample suboptimality guarantee of finding the optimal in-class policy.
arXiv Detail & Related papers (2022-09-18T22:03:55Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - Offline Reinforcement Learning Under Value and Density-Ratio
Realizability: the Power of Gaps [15.277483173402128]
We provide guarantees to a pessimistic algorithm based on a version space formed by marginalized importance sampling.
Our work is the first to identify the utility and the novel mechanism of gap assumptions in offline reinforcement learning.
arXiv Detail & Related papers (2022-03-25T23:33:38Z) - Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori.
We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z) - What are the Statistical Limits of Offline RL with Linear Function
Approximation? [70.33301077240763]
offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of sequential decision making strategies.
This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning.
arXiv Detail & Related papers (2020-10-22T17:32:13Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.