PAC Guarantees for Reinforcement Learning: Sample Complexity, Coverage, and Structure
- URL: http://arxiv.org/abs/2603.01309v1
- Date: Sun, 01 Mar 2026 22:33:16 GMT
- Title: PAC Guarantees for Reinforcement Learning: Sample Complexity, Coverage, and Structure
- Authors: Joshua Steier,
- Abstract summary: We propose the Coverage-Objective-Structure (CSO) framework, which decomposes nearly every PAC sample complexity result into three factors.<n> CSO is not a theorem but an interpretive template that identifies bottlenecks and makes cross-setting comparison immediate.<n>We provide practitioner tools: rate lookup tables indexed by CSO coordinates, Bellman residual diagnostics, coverage estimation with deployment gates, and per-episode policy certificates.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When data is scarce or mistakes are costly, average-case metrics fall short. What a practitioner needs is a guarantee: with probability at least $1-δ$, the learned policy is $\varepsilon$-close to optimal after $N$ episodes. This is the PAC promise, and between 2018 and 2025 the RL theory community made striking progress on when such promises can be kept. We survey that progress. Our organizing tool is the Coverage-Structure-Objective (CSO) framework, proposed here, which decomposes nearly every PAC sample complexity result into three factors: coverage (how data were obtained), structure (intrinsic MDP or function-class complexity), and objective (what the learner must deliver). CSO is not a theorem but an interpretive template that identifies bottlenecks and makes cross-setting comparison immediate. The technical core covers tight tabular baselines and the uniform-PAC bridge to regret; structural complexity measures (Bellman rank, witness rank, Bellman-Eluder dimension) governing learnability with function approximation; results for linear, kernel/NTK, and low-rank models; reward-free exploration as upfront coverage investment; and pessimistic offline RL where inherited coverage is the binding constraint. We provide practitioner tools: rate lookup tables indexed by CSO coordinates, Bellman residual diagnostics, coverage estimation with deployment gates, and per-episode policy certificates. A final section catalogs open problems, separating near-term targets from frontier questions where coverage, structure, and computation tangle in ways current theory cannot resolve.
Related papers
- CoVeR: Conformal Calibration for Versatile and Reliable Autoregressive Next-Token Prediction [49.09876340754804]
conformsctextCoVeR is a model-free decoding strategy that balances search efficiency with the need for versatile trajectories.<n>We show that conformsctextCoVeR simultaneously maintains a compact search space and ensures high coverage probability over desirable trajectories.
arXiv Detail & Related papers (2025-09-05T01:07:12Z) - What Fundamental Structure in Reward Functions Enables Efficient Sparse-Reward Learning? [6.908972852063454]
Policy-Aware Matrix Completion (PAMC) is a first concrete step toward a structural reward learning framework.<n>Our results highlight PAMC as a practical and principled tool when structural rewards exist, and as a concrete first instantiation of a broader structural reward learning perspective.
arXiv Detail & Related papers (2025-09-04T00:53:02Z) - Provably Sample-Efficient Robust Reinforcement Learning with Average Reward [4.530028899565083]
We propose a new algorithm designed for robust Markov Decision Processes (MDPs) with transition uncertainty characterized by $ell_p$-norm and contamination models.<n>Our algorithm operates without requiring any prior knowledge of the robust MDP.<n>Our work provides essential theoretical understanding of sample efficiency of robust average reward RL.
arXiv Detail & Related papers (2025-05-18T15:34:45Z) - When is Agnostic Reinforcement Learning Statistically Tractable? [76.1408672715773]
A new complexity measure, called the emphspanning capacity, depends solely on the set $Pi$ and is independent of the MDP dynamics.
We show there exists a policy class $Pi$ with a bounded spanning capacity that requires a superpolynomial number of samples to learn.
This reveals a surprising separation for learnability between generative access and online access models.
arXiv Detail & Related papers (2023-10-09T19:40:54Z) - Active Coverage for PAC Reinforcement Learning [24.256960622176305]
This paper formalizes the problem of active coverage in episodic Markov decision processes (MDPs)
We show that CovGame can be used as a building block to solve different PAC RL tasks.
arXiv Detail & Related papers (2023-06-23T16:39:37Z) - Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage [100.8180383245813]
We propose value-based algorithms for offline reinforcement learning (RL)
We show an analogous result for vanilla Q-functions under a soft margin condition.
Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
arXiv Detail & Related papers (2023-02-05T14:22:41Z) - PAC Verification of Statistical Algorithms [0.10878040851637999]
We develop the setting of PAC verification, where a hypothesis (machine learning model) that purportedly satisfies the agnostic PAC learning objective is verified using an interactive proof.
We present a protocol for PAC verification of unions of intervals over $mathbbR$ that improves upon their proposed protocol for that task, and matches our lower bound's dependence on $d$.
Our final result is a protocol for the verification of statistical algorithms that satisfy a constraint on their queries.
arXiv Detail & Related papers (2022-11-28T23:57:27Z) - Nearly Optimal Latent State Decoding in Block MDPs [74.51224067640717]
In episodic Block MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states.
We are first interested in estimating the latent state decoding function based on data generated under a fixed behavior policy.
We then study the problem of learning near-optimal policies in the reward-free framework.
arXiv Detail & Related papers (2022-08-17T18:49:53Z) - Is Vertical Logistic Regression Privacy-Preserving? A Comprehensive
Privacy Analysis and Beyond [57.10914865054868]
We consider vertical logistic regression (VLR) trained with mini-batch descent gradient.
We provide a comprehensive and rigorous privacy analysis of VLR in a class of open-source Federated Learning frameworks.
arXiv Detail & Related papers (2022-07-19T05:47:30Z) - The Fundamental Price of Secure Aggregation in Differentially Private
Federated Learning [34.630300910399036]
We characterize the fundamental communication cost required to obtain the best accuracy under $varepsilon$ central DP.
Our results show that $tildeOleft( min(n2varepsilon2, d) right)$ bits per client are both sufficient and necessary.
This provides a significant improvement relative to state-of-the-art SecAgg distributed DP schemes.
arXiv Detail & Related papers (2022-03-07T22:56:09Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.