RL for Latent MDPs: Regret Guarantees and a Lower Bound
- URL: http://arxiv.org/abs/2102.04939v1
- Date: Tue, 9 Feb 2021 16:49:58 GMT
- Title: RL for Latent MDPs: Regret Guarantees and a Lower Bound
- Authors: Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, Shie Mannor
- Abstract summary: We consider the regret problem for reinforcement learning in latent Markov Decision Processes (LMDP)
In an LMDP, an MDP is randomly drawn from a set of $M$ possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent.
We show that the key link is a notion of separation between the MDP system dynamics.
- Score: 74.41782017817808
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we consider the regret minimization problem for reinforcement
learning in latent Markov Decision Processes (LMDP). In an LMDP, an MDP is
randomly drawn from a set of $M$ possible MDPs at the beginning of the
interaction, but the identity of the chosen MDP is not revealed to the agent.
We first show that a general instance of LMDPs requires at least
$\Omega((SA)^M)$ episodes to even approximate the optimal policy. Then, we
consider sufficient assumptions under which learning good policies requires
polynomial number of episodes. We show that the key link is a notion of
separation between the MDP system dynamics. With sufficient separation, we
provide an efficient algorithm with local guarantee, {\it i.e.,} providing a
sublinear regret guarantee when we are given a good initialization. Finally, if
we are given standard statistical sufficiency assumptions common in the
Predictive State Representation (PSR) literature (e.g., Boots et al.) and a
reachability assumption, we show that the need for initialization can be
removed.
Related papers
- Achieving $\tilde{O}(1/ε)$ Sample Complexity for Constrained Markov Decision Process [4.685121820790546]
We consider the reinforcement learning problem for the constrained Markov decision process (CMDP)
We derive a logarithmic regret bound, which translates into a $O(frac1Deltacdotepscdotlog2(1/eps))$ sample complexity bound.
Our algorithm operates in the primal space and we resolve the primal LP for the CMDP problem at each period in an online manner.
arXiv Detail & Related papers (2024-02-26T06:08:25Z) - Provably Efficient CVaR RL in Low-rank MDPs [58.58570425202862]
We study risk-sensitive Reinforcement Learning (RL)
We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to balance interplay between exploration, exploitation, and representation learning in CVaR RL.
We prove that our algorithm achieves a sample complexity of $epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.
arXiv Detail & Related papers (2023-11-20T17:44:40Z) - Prospective Side Information for Latent MDPs [80.00842638151558]
We study the class of LMDPs with em prospective side information, when an agent receives additional, weakly revealing, information at the beginning of each episode.
We show that, surprisingly, this problem is not captured by contemporary settings and algorithms designed for partially observed environments.
We then establish that any sample efficient algorithm must suffer at least $Omega(K2/3)$-regret, as opposed to standard $Omega(sqrtK)$ lower bounds, and design an algorithm with a matching upper bound.
arXiv Detail & Related papers (2023-10-11T15:37:31Z) - A Theoretical Analysis of Optimistic Proximal Policy Optimization in
Linear Markov Decision Processes [13.466249082564213]
We propose an optimistic variant of PPO for episodic adversarial linear MDPs with full-information feedback.
Compared with existing policy-based algorithms, we achieve the state-of-the-art regret bound in both linear MDPs and adversarial linear MDPs with full information.
arXiv Detail & Related papers (2023-05-15T17:55:24Z) - PAC Statistical Model Checking of Mean Payoff in Discrete- and
Continuous-Time MDP [0.34410212782758043]
We provide the first algorithm to compute mean payoff probably approximately correctly in unknown MDP.
We do not require any knowledge of the state space, only a lower bound on the minimum transition probability.
In addition to providing probably approximately correct (PAC) bounds for our algorithm, we also demonstrate its practical nature by running experiments on standard benchmarks.
arXiv Detail & Related papers (2022-06-03T09:13:27Z) - Semi-Markov Offline Reinforcement Learning for Healthcare [57.15307499843254]
We introduce three offline RL algorithms, namely, SDQN, SDDQN, and SBCQ.
We experimentally demonstrate that only these algorithms learn the optimal policy in variable-time environments.
We apply our new algorithms to a real-world offline dataset pertaining to warfarin dosing for stroke prevention.
arXiv Detail & Related papers (2022-03-17T14:51:21Z) - Safe Exploration by Solving Early Terminated MDP [77.10563395197045]
We introduce a new approach to address safe RL problems under the framework of Early TerminatedP (ET-MDP)
We first define the ET-MDP as an unconstrained algorithm with the same optimal value function as its corresponding CMDP.
An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better performance and improved learning efficiency.
arXiv Detail & Related papers (2021-07-09T04:24:40Z) - A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs [117.82903457289584]
We derive a novel problem-dependent lower-bound for regret in finite-horizon Markov Decision Processes (MDPs)
We show that our lower-bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all.
We show that this last result is attainable (up to $poly(H)$ terms, where $H$ is the horizon) by providing a regret upper-bound based on policy gaps for an optimistic algorithm.
arXiv Detail & Related papers (2021-06-24T13:46:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.