Anytime-Competitive Reinforcement Learning with Policy Prior
- URL: http://arxiv.org/abs/2311.01568v3
- Date: Fri, 2 Feb 2024 20:58:42 GMT
- Title: Anytime-Competitive Reinforcement Learning with Policy Prior
- Authors: Jianyi Yang, Pengfei Li, Tongxin Li, Adam Wierman, Shaolei Ren
- Abstract summary: A-CMDP is to optimize the expected reward while guaranteeing a bounded cost in each round of any episode against a policy prior.
We propose a new algorithm, called Anytime-Competitive Reinforcement Learning (ACRL), which provably guarantees the anytime cost constraints.
- Score: 41.45104303955067
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies the problem of Anytime-Competitive Markov Decision Process
(A-CMDP). Existing works on Constrained Markov Decision Processes (CMDPs) aim
to optimize the expected reward while constraining the expected cost over
random dynamics, but the cost in a specific episode can still be
unsatisfactorily high. In contrast, the goal of A-CMDP is to optimize the
expected reward while guaranteeing a bounded cost in each round of any episode
against a policy prior. We propose a new algorithm, called Anytime-Competitive
Reinforcement Learning (ACRL), which provably guarantees the anytime cost
constraints. The regret analysis shows the policy asymptotically matches the
optimal reward achievable under the anytime competitive constraints.
Experiments on the application of carbon-intelligent computing verify the
reward performance and cost constraint guarantee of ACRL.
Related papers
- Anytime Incremental $ρ$POMDP Planning in Continuous Spaces [5.767643556541711]
We present an anytime solver that dynamically refines belief representations, with formal guarantees of improvement over time.
We demonstrate its effectiveness for common entropy estimators, reducing computational cost by orders of magnitude.
Experimental results show that $rho$POMCPOW outperforms state-of-the-art solvers in both efficiency and solution quality.
arXiv Detail & Related papers (2025-02-04T18:19:40Z) - Learning Constrained Markov Decision Processes With Non-stationary Rewards and Constraints [34.7178680288326]
In constrained Markov decision processes (CMDPs) with adversarial rewards and constraints, a well-known impossibility result prevents any algorithm from attaining sublinear regret and sublinear constraint violation.
We show that this negative result can be eased in CMDPs with non-stationary rewards and constraints, by providing algorithms whose performances smoothly degrade as non-stationarity increases.
arXiv Detail & Related papers (2024-05-23T09:48:48Z) - Anytime-Constrained Reinforcement Learning [6.981971551979697]
We introduce and study constrained Markov Decision Processes (cMDPs) with anytime constraints.
We show that there exist optimal deterministic policies augmented with cumulative costs.
We show that computing non-trivial approximately optimal policies is NP-hard in general.
arXiv Detail & Related papers (2023-11-09T16:51:26Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - A Unifying Framework for Online Optimization with Long-Term Constraints [62.35194099438855]
We study online learning problems in which a decision maker has to take a sequence of decisions subject to $m$ long-term constraints.
The goal is to maximize their total reward, while at the same time achieving small cumulative violation across the $T$ rounds.
We present the first best-of-both-world type algorithm for this general class problems, with no-regret guarantees both in the case in which rewards and constraints are selected according to an unknown model, and in the case in which they are selected at each round by an adversary.
arXiv Detail & Related papers (2022-09-15T16:59:19Z) - COptiDICE: Offline Constrained Reinforcement Learning via Stationary
Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset.
We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution.
Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z) - A Lyapunov-Based Methodology for Constrained Optimization with Bandit
Feedback [22.17503016665027]
We consider problems where each action returns a random reward, cost, and penalty from an unknown joint distribution.
We propose a novel low-complexity algorithm, named $tt LyOn$, and prove that it achieves $O(sqrtBlog B)$ regret and $O(log B/B)$ constraint-violation.
The low computational cost bounds of $tt LyOn$ suggest that Lyapunov-based algorithm design methodology can be effective in solving constrained bandit optimization problems.
arXiv Detail & Related papers (2021-06-09T16:12:07Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - Hierarchical Adaptive Contextual Bandits for Resource Constraint based
Recommendation [49.69139684065241]
Contextual multi-armed bandit (MAB) achieves cutting-edge performance on a variety of problems.
In this paper, we propose a hierarchical adaptive contextual bandit method (HATCH) to conduct the policy learning of contextual bandits with a budget constraint.
arXiv Detail & Related papers (2020-04-02T17:04:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.