Agnostic Reinforcement Learning: Foundations and Algorithms
- URL: http://arxiv.org/abs/2506.01884v1
- Date: Mon, 02 Jun 2025 17:12:24 GMT
- Title: Agnostic Reinforcement Learning: Foundations and Algorithms
- Authors: Gene Li,
- Abstract summary: This thesis rigorously examines the statistical complexity of RL with function approximation from a learning theoretic perspective.<n>We explore agnostic policy learning, in which the learner seeks to find the best policy in a given class $Pi$, with no guarantee that $Pi$ contains an optimal policy for the underlying task.<n>Within this comprehensive framework, we design new learning algorithms with theoretical guarantees and characterize fundamental performance bounds of any algorithm.
- Score: 4.07926531936425
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning (RL) has demonstrated tremendous empirical success across numerous challenging domains. However, we lack a strong theoretical understanding of the statistical complexity of RL in environments with large state spaces, where function approximation is required for sample-efficient learning. This thesis addresses this gap by rigorously examining the statistical complexity of RL with function approximation from a learning theoretic perspective. Departing from a long history of prior work, we consider the weakest form of function approximation, called agnostic policy learning, in which the learner seeks to find the best policy in a given class $\Pi$, with no guarantee that $\Pi$ contains an optimal policy for the underlying task. We systematically explore agnostic policy learning along three key axes: environment access -- how a learner collects data from the environment; coverage conditions -- intrinsic properties of the underlying MDP measuring the expansiveness of state-occupancy measures for policies in the class $\Pi$, and representational conditions -- structural assumptions on the class $\Pi$ itself. Within this comprehensive framework, we (1) design new learning algorithms with theoretical guarantees and (2) characterize fundamental performance bounds of any algorithm. Our results reveal significant statistical separations that highlight the power and limitations of agnostic policy learning.
Related papers
- Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z) - Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits [58.63897489864948]
Reinforcement learning with outcome-based feedback faces a fundamental challenge.<n>How do we assign credit to the right actions?<n>This paper provides the first comprehensive analysis of this problem in online RL with general function approximation.
arXiv Detail & Related papers (2025-05-26T17:44:08Z) - The Role of Environment Access in Agnostic Reinforcement Learning [37.457194209439926]
We study Reinforcement Learning (RL) in environments with large state spaces.<n>We consider the weakest possible form of function approximation, called agnostic policy learning.<n>We show that sample-efficient policy learning is not possible in the standard online RL setting.
arXiv Detail & Related papers (2025-04-07T18:19:56Z) - Scalable Online Exploration via Coverability [45.66375686120087]
Exploration is a major challenge in reinforcement learning, especially for high-dimensional domains that require function approximation.
We introduce a new objective, $L_Coverage, which generalizes previous exploration schemes and supports three fundamental desideratas.
$L_Coverage enables the first computationally efficient model-based and model-free algorithms for online (reward-free or reward-driven) reinforcement learning in MDPs with low coverability.
arXiv Detail & Related papers (2024-03-11T10:14:06Z) - Bi-Level Offline Policy Optimization with Limited Exploration [1.8130068086063336]
We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset.
We propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level)
We evaluate our model using a blend of synthetic, benchmark, and real-world datasets for offline RL, showing that it performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2023-10-10T02:45:50Z) - When is Agnostic Reinforcement Learning Statistically Tractable? [76.1408672715773]
A new complexity measure, called the emphspanning capacity, depends solely on the set $Pi$ and is independent of the MDP dynamics.
We show there exists a policy class $Pi$ with a bounded spanning capacity that requires a superpolynomial number of samples to learn.
This reveals a surprising separation for learnability between generative access and online access models.
arXiv Detail & Related papers (2023-10-09T19:40:54Z) - Provably Efficient Offline Goal-Conditioned Reinforcement Learning with
General Function Approximation and Single-Policy Concentrability [11.786486763236104]
Goal-conditioned reinforcement learning (GCRL) refers to learning general-purpose skills that aim to reach diverse goals.
offline GCRL only requires purely pre-collected datasets to perform training tasks.
We show that a modified offline GCRL algorithm is both provably efficient with general function approximation and single-policy concentrability.
arXiv Detail & Related papers (2023-02-07T22:04:55Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Offline Reinforcement Learning: Fundamental Barriers for Value Function
Approximation [74.3002974673248]
We consider the offline reinforcement learning problem, where the aim is to learn a decision making policy from logged data.
offline RL is becoming increasingly relevant in practice, because online data collection is well suited to safety-critical domains.
Our results show that sample-efficient offline reinforcement learning requires either restrictive coverage conditions or representation conditions that go beyond complexity learning.
arXiv Detail & Related papers (2021-11-21T23:22:37Z) - Provably Correct Optimization and Exploration with Non-linear Policies [65.60853260886516]
ENIAC is an actor-critic method that allows non-linear function approximation in the critic.
We show that under certain assumptions, the learner finds a near-optimal policy in $O(poly(d))$ exploration rounds.
We empirically evaluate this adaptation and show that it outperforms priors inspired by linear methods.
arXiv Detail & Related papers (2021-03-22T03:16:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.