Related papers: Active Coverage for PAC Reinforcement Learning

Active Coverage for PAC Reinforcement Learning

URL: http://arxiv.org/abs/2306.13601v1
Date: Fri, 23 Jun 2023 16:39:37 GMT
Title: Active Coverage for PAC Reinforcement Learning
Authors: Aymen Al-Marjani, Andrea Tirinzoni, Emilie Kaufmann
Abstract summary: This paper formalizes the problem of active coverage in episodic Markov decision processes (MDPs) We show that CovGame can be used as a building block to solve different PAC RL tasks.
Score: 24.256960622176305
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Collecting and leveraging data with good coverage properties plays a crucial role in different aspects of reinforcement learning (RL), including reward-free exploration and offline learning. However, the notion of "good coverage" really depends on the application at hand, as data suitable for one context may not be so for another. In this paper, we formalize the problem of active coverage in episodic Markov decision processes (MDPs), where the goal is to interact with the environment so as to fulfill given sampling requirements. This framework is sufficiently flexible to specify any desired coverage property, making it applicable to any problem that involves online exploration. Our main contribution is an instance-dependent lower bound on the sample complexity of active coverage and a simple game-theoretic algorithm, CovGame, that nearly matches it. We then show that CovGame can be used as a building block to solve different PAC RL tasks. In particular, we obtain a simple algorithm for PAC reward-free exploration with an instance-dependent sample complexity that, in certain MDPs which are "easy to explore", is lower than the minimax one. By further coupling this exploration algorithm with a new technique to do implicit eliminations in policy space, we obtain a computationally-efficient algorithm for best-policy identification whose instance-dependent sample complexity scales with gaps between policy values.

Related papers

Offline RL via Feature-Occupancy Gradient Ascent [9.983014605039658]
We study offline Reinforcement Learning in large infinite-horizon discounted Markov Decision Processes (MDPs) We develop a new algorithm that performs a form of gradient ascent in the space of feature occupancies. We show that the resulting simple algorithm satisfies strong computational and sample complexity guarantees.
arXiv Detail & Related papers (2024-05-22T15:39:05Z)
Offline Imitation Learning from Multiple Baselines with Applications to Compiler Optimization [17.729842629392742]
We study a Reinforcement Learning problem in which we are given a set of trajectories collected with K baseline policies. The goal is to learn a policy which performs as well as the best combination of baselines on the entire state space.
arXiv Detail & Related papers (2024-03-28T14:34:02Z)
Nearly Optimal Latent State Decoding in Block MDPs [74.51224067640717]
In episodic Block MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states. We are first interested in estimating the latent state decoding function based on data generated under a fixed behavior policy. We then study the problem of learning near-optimal policies in the reward-free framework.
arXiv Detail & Related papers (2022-08-17T18:49:53Z)
Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design [12.056495277232118]
This work seeks to understand the "instance-dependent" complexity of learning near-optimal policies. We propose an algorithm, textscPedel, which achieves a fine-grained instance-dependent measure of complexity. We show that textscPedel yields provable gains over low-regret, minimax-optimal algorithms.
arXiv Detail & Related papers (2022-07-06T10:42:57Z)
Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs [24.256960622176305]
We propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in episodic Markov decision processes. Our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap. Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts.
arXiv Detail & Related papers (2022-03-17T11:19:41Z)
Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning [82.31436758872715]
We develop an algorithm that achieves the same PAC guarantee while using only $O(1)$ episodes of environment interactions. We establish a connection between value functions in discounted and finite-horizon Markov decision processes.
arXiv Detail & Related papers (2021-11-01T00:21:24Z)
On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. We tackle this problem under the context of function approximation, leveraging powerful function approximators. We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z)
Improved Algorithms for Agnostic Pool-based Active Classification [20.12178157010804]
We consider active learning for binary classification in the agnostic pool-based setting. Our algorithm is superior to state of the art active learning algorithms on image classification datasets.
arXiv Detail & Related papers (2021-05-13T18:24:30Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function. We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z)
A Provably Efficient Sample Collection Strategy for Reinforcement Learning [123.69175280309226]
One of the challenges in online reinforcement learning (RL) is that the agent needs to trade off the exploration of the environment and the exploitation of the samples to optimize its behavior. We propose to tackle the exploration-exploitation problem following a decoupled approach composed of: 1) An "objective-specific" algorithm that prescribes how many samples to collect at which states, as if it has access to a generative model (i.e., sparse simulator of the environment); 2) An "objective-agnostic" sample collection responsible for generating the prescribed samples as fast as possible.
arXiv Detail & Related papers (2020-07-13T15:17:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.