Learning in Markov Decision Processes under Constraints
- URL: http://arxiv.org/abs/2002.12435v5
- Date: Wed, 5 Jan 2022 19:21:00 GMT
- Title: Learning in Markov Decision Processes under Constraints
- Authors: Rahul Singh, Abhishek Gupta and Ness B. Shroff
- Abstract summary: We consider reinforcement learning in Markov Decision Processes in which an agent repeatedly interacts with an environment that is modeled by a controlled Markov process.
We design model-based RL algorithms that maximize the cumulative reward earned over a time horizon of $T$ time-steps.
We show how to reduce the regret of a desired subset of the $M$ costs, at the expense of increasing the regrets of rewards and the remaining costs.
- Score: 34.03325546283489
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider reinforcement learning (RL) in Markov Decision Processes in which
an agent repeatedly interacts with an environment that is modeled by a
controlled Markov process. At each time step $t$, it earns a reward, and also
incurs a cost-vector consisting of $M$ costs. We design model-based RL
algorithms that maximize the cumulative reward earned over a time horizon of
$T$ time-steps, while simultaneously ensuring that the average values of the
$M$ cost expenditures are bounded by agent-specified thresholds
$c^{ub}_i,i=1,2,\ldots,M$.
In order to measure the performance of a reinforcement learning algorithm
that satisfies the average cost constraints, we define an $M+1$ dimensional
regret vector that is composed of its reward regret, and $M$ cost regrets. The
reward regret measures the sub-optimality in the cumulative reward, while the
$i$-th component of the cost regret vector is the difference between its $i$-th
cumulative cost expense and the expected cost expenditures $Tc^{ub}_i$.
We prove that the expected value of the regret vector of UCRL-CMDP, is
upper-bounded as $\tilde{O}\left(T^{2\slash 3}\right)$, where $T$ is the time
horizon. We further show how to reduce the regret of a desired subset of the
$M$ costs, at the expense of increasing the regrets of rewards and the
remaining costs. To the best of our knowledge, ours is the only work that
considers non-episodic RL under average cost constraints, and derive algorithms
that can~\emph{tune the regret vector} according to the agent's requirements on
its cost regrets.
Related papers
- Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism [1.4999444543328293]
We propose a model-based algorithm based on novel cost and reward function estimators.
Our algorithm achieves a regret upper bound of $widetildemathcalO((bar C - bar C_b)-1H2.5 SsqrtAK)$ where $bar C$ is the cost budget for an episode.
arXiv Detail & Related papers (2024-10-14T04:51:06Z) - Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.
We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Stochastic Shortest Path: Minimax, Parameter-Free and Towards
Horizon-Free Regret [144.6358229217845]
We study the problem of learning in the shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state.
We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus.
We prove that EB-SSP achieves the minimax regret rate $widetildeO(B_star sqrtS A K)$, where $K$ is the number of episodes, $S$ is the number of states, $A$
arXiv Detail & Related papers (2021-04-22T17:20:48Z) - Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost.
We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state.
Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z) - Model-Free Non-Stationary RL: Near-Optimal Regret and Applications in
Multi-Agent RL and Inventory Control [28.80743320843154]
We propose Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), the first model-free algorithm for non-stationary RL.
We show that our algorithms are emphnearly optimal by establishing an information-theoretical lower bound of $Omega(Sfrac13 Afrac13 Deltafrac13 Hfrac23 Tfrac23)$, the first lower bound in non-stationary RL.
arXiv Detail & Related papers (2020-10-07T04:55:56Z) - Near-optimal Regret Bounds for Stochastic Shortest Path [63.029132134792555]
shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost.
We show that any learning algorithm must have at least $Omega(B_star sqrt|S| |A| K)$ regret in the worst case.
arXiv Detail & Related papers (2020-02-23T09:10:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.