Active Coverage for PAC Reinforcement Learning
- URL: http://arxiv.org/abs/2306.13601v1
- Date: Fri, 23 Jun 2023 16:39:37 GMT
- Title: Active Coverage for PAC Reinforcement Learning
- Authors: Aymen Al-Marjani, Andrea Tirinzoni, Emilie Kaufmann
- Abstract summary: This paper formalizes the problem of active coverage in episodic Markov decision processes (MDPs)
We show that CovGame can be used as a building block to solve different PAC RL tasks.
- Score: 24.256960622176305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Collecting and leveraging data with good coverage properties plays a crucial
role in different aspects of reinforcement learning (RL), including reward-free
exploration and offline learning. However, the notion of "good coverage" really
depends on the application at hand, as data suitable for one context may not be
so for another. In this paper, we formalize the problem of active coverage in
episodic Markov decision processes (MDPs), where the goal is to interact with
the environment so as to fulfill given sampling requirements. This framework is
sufficiently flexible to specify any desired coverage property, making it
applicable to any problem that involves online exploration. Our main
contribution is an instance-dependent lower bound on the sample complexity of
active coverage and a simple game-theoretic algorithm, CovGame, that nearly
matches it. We then show that CovGame can be used as a building block to solve
different PAC RL tasks. In particular, we obtain a simple algorithm for PAC
reward-free exploration with an instance-dependent sample complexity that, in
certain MDPs which are "easy to explore", is lower than the minimax one. By
further coupling this exploration algorithm with a new technique to do implicit
eliminations in policy space, we obtain a computationally-efficient algorithm
for best-policy identification whose instance-dependent sample complexity
scales with gaps between policy values.
Related papers
- Offline RL via Feature-Occupancy Gradient Ascent [9.983014605039658]
We study offline Reinforcement Learning in large infinite-horizon discounted Markov Decision Processes (MDPs)
We develop a new algorithm that performs a form of gradient ascent in the space of feature occupancies.
We show that the resulting simple algorithm satisfies strong computational and sample complexity guarantees.
arXiv Detail & Related papers (2024-05-22T15:39:05Z) - Offline Imitation Learning from Multiple Baselines with Applications to Compiler Optimization [17.729842629392742]
We study a Reinforcement Learning problem in which we are given a set of trajectories collected with K baseline policies.
The goal is to learn a policy which performs as well as the best combination of baselines on the entire state space.
arXiv Detail & Related papers (2024-03-28T14:34:02Z) - Nearly Optimal Latent State Decoding in Block MDPs [74.51224067640717]
In episodic Block MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states.
We are first interested in estimating the latent state decoding function based on data generated under a fixed behavior policy.
We then study the problem of learning near-optimal policies in the reward-free framework.
arXiv Detail & Related papers (2022-08-17T18:49:53Z) - Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via
Online Experiment Design [12.056495277232118]
This work seeks to understand the "instance-dependent" complexity of learning near-optimal policies.
We propose an algorithm, textscPedel, which achieves a fine-grained instance-dependent measure of complexity.
We show that textscPedel yields provable gains over low-regret, minimax-optimal algorithms.
arXiv Detail & Related papers (2022-07-06T10:42:57Z) - Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs [24.256960622176305]
We propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in episodic Markov decision processes.
Our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap.
Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts.
arXiv Detail & Related papers (2022-03-17T11:19:41Z) - Settling the Horizon-Dependence of Sample Complexity in Reinforcement
Learning [82.31436758872715]
We develop an algorithm that achieves the same PAC guarantee while using only $O(1)$ episodes of environment interactions.
We establish a connection between value functions in discounted and finite-horizon Markov decision processes.
arXiv Detail & Related papers (2021-11-01T00:21:24Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - Improved Algorithms for Agnostic Pool-based Active Classification [20.12178157010804]
We consider active learning for binary classification in the agnostic pool-based setting.
Our algorithm is superior to state of the art active learning algorithms on image classification datasets.
arXiv Detail & Related papers (2021-05-13T18:24:30Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z) - A Provably Efficient Sample Collection Strategy for Reinforcement
Learning [123.69175280309226]
One of the challenges in online reinforcement learning (RL) is that the agent needs to trade off the exploration of the environment and the exploitation of the samples to optimize its behavior.
We propose to tackle the exploration-exploitation problem following a decoupled approach composed of: 1) An "objective-specific" algorithm that prescribes how many samples to collect at which states, as if it has access to a generative model (i.e., sparse simulator of the environment); 2) An "objective-agnostic" sample collection responsible for generating the prescribed samples as fast as possible.
arXiv Detail & Related papers (2020-07-13T15:17:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.