Polynomial-Time Approximability of Constrained Reinforcement Learning
- URL: http://arxiv.org/abs/2502.07764v1
- Date: Tue, 11 Feb 2025 18:47:53 GMT
- Title: Polynomial-Time Approximability of Constrained Reinforcement Learning
- Authors: Jeremy McMahan,
- Abstract summary: We study the computational complexity of approximating general constrained Markov decision processes.
Our contribution is the design of a time $(0,epsilon)$-additive approximation for finding optimal constrained policies.
- Score: 1.223779595809275
- License:
- Abstract: We study the computational complexity of approximating general constrained Markov decision processes. Our primary contribution is the design of a polynomial time $(0,\epsilon)$-additive bicriteria approximation algorithm for finding optimal constrained policies across a broad class of recursively computable constraints, including almost-sure, chance, expectation, and their anytime variants. Matching lower bounds imply our approximation guarantees are optimal so long as $P \neq NP$. The generality of our approach results in answers to several long-standing open complexity questions in the constrained reinforcement learning literature. Specifically, we are the first to prove polynomial-time approximability for the following settings: policies under chance constraints, deterministic policies under multiple expectation constraints, policies under non-homogeneous constraints (i.e., constraints of different types), and policies under constraints for continuous-state processes.
Related papers
- Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action [10.219627570276689]
We develop a framework for a class of Markov Decision Processes with general state and spaces.
We show that gradient methods converge to the globally optimal policy with a nonasymptomatic condition.
Our result establishes first complexity for multi-period inventory systems.
arXiv Detail & Related papers (2024-09-25T17:56:02Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Deterministic Policies for Constrained Reinforcement Learning in Polynomial Time [1.223779595809275]
Our algorithm efficiently computes near-optimal deterministic policies for constrained reinforcement learning problems.
Our work answers three open questions spanning two long-standing lines of research.
arXiv Detail & Related papers (2024-05-23T05:27:51Z) - Learning Adversarial MDPs with Stochastic Hard Constraints [37.24692425018]
We study online learning in constrained Markov decision processes (CMDPs) with adversarial losses and hard constraints.
Our work is the first to study CMDPs involving both adversarial losses and hard constraints.
arXiv Detail & Related papers (2024-03-06T12:49:08Z) - Towards Instance-Optimality in Online PAC Reinforcement Learning [28.156332484814616]
We propose the first instance-dependent lower bound on the sample complexity required for the PAC identification of a near-optimal policy.
We demonstrate that the sample complexity of the PEDEL algorithm of citeWagenmaker22linearMDP closely approaches this lower bound.
arXiv Detail & Related papers (2023-10-31T19:26:36Z) - High-probability sample complexities for policy evaluation with linear function approximation [88.87036653258977]
We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms.
We establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level.
arXiv Detail & Related papers (2023-05-30T12:58:39Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Nearly Optimal Latent State Decoding in Block MDPs [74.51224067640717]
In episodic Block MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states.
We are first interested in estimating the latent state decoding function based on data generated under a fixed behavior policy.
We then study the problem of learning near-optimal policies in the reward-free framework.
arXiv Detail & Related papers (2022-08-17T18:49:53Z) - Provable Reinforcement Learning with a Short-Term Memory [68.00677878812908]
We study a new subclass of POMDPs, whose latent states can be decoded by the most recent history of a short length $m$.
In particular, in the rich-observation setting, we develop new algorithms using a novel "moment matching" approach with a sample complexity that scales exponentially.
Our results show that a short-term memory suffices for reinforcement learning in these environments.
arXiv Detail & Related papers (2022-02-08T16:39:57Z) - The Geometry of Memoryless Stochastic Policy Optimization in
Infinite-Horizon POMDPs [0.0]
We consider the problem of finding the best memoryless policy for an infinite-horizon partially observable decision process.
We show that the discounted state-action frequencies and the expected cumulative reward are the functions of the policy, whereby the degree is determined by the degree of partial observability.
arXiv Detail & Related papers (2021-10-14T14:42:09Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.