Posterior Sampling-based Online Learning for Episodic POMDPs
- URL: http://arxiv.org/abs/2310.10107v4
- Date: Wed, 23 Oct 2024 14:47:42 GMT
- Title: Posterior Sampling-based Online Learning for Episodic POMDPs
- Authors: Dengwang Tang, Dongze Ye, Rahul Jain, Ashutosh Nayyar, Pierluigi Nuzzo,
- Abstract summary: We consider the online learning problem for episodic POMDPs with unknown transition and observation models.
We propose a Posterior Sampling-based reinforcement learning algorithm for POMDPs.
- Score: 5.797837329787459
- License:
- Abstract: Learning in POMDPs is known to be significantly harder than in MDPs. In this paper, we consider the online learning problem for episodic POMDPs with unknown transition and observation models. We propose a Posterior Sampling-based reinforcement learning algorithm for POMDPs (PS4POMDPs), which is much simpler and more implementable compared to state-of-the-art optimism-based online learning algorithms for POMDPs. We show that the Bayesian regret of the proposed algorithm scales as the square root of the number of episodes and is polynomial in the other parameters. In a general setting, the regret scales exponentially in the horizon length $H$, and we show that this is inevitable by providing a lower bound. However, when the POMDP is undercomplete and weakly revealing (a common assumption in the recent literature), we establish a polynomial Bayesian regret bound. We finally propose a posterior sampling algorithm for multi-agent POMDPs, and show it too has sublinear regret.
Related papers
- Near-Optimal Learning and Planning in Separated Latent MDPs [70.88315649628251]
We study computational and statistical aspects of learning Latent Markov Decision Processes (LMDPs)
In this model, the learner interacts with an MDP drawn at the beginning of each epoch from an unknown mixture of MDPs.
arXiv Detail & Related papers (2024-06-12T06:41:47Z) - Sample-Efficient Learning of POMDPs with Multiple Observations In
Hindsight [105.6882315781987]
This paper studies the sample-efficiency of learning in Partially Observable Markov Decision Processes (POMDPs)
Motivated by real-world settings such as loading in game playing, we propose an enhanced feedback model called multiple observations in hindsight''
We show that sample-efficient learning is possible for two new subclasses of POMDPs: emphmulti-observation revealing POMDPs and emphdistinguishable POMDPs
arXiv Detail & Related papers (2023-07-06T09:39:01Z) - Learning in POMDPs is Sample-Efficient with Hindsight Observability [36.66596305441365]
POMDPs capture a broad class of decision making problems, but hardness results suggest that learning is intractable even in simple settings due to the inherent partial observability.
In many realistic problems, more information is either revealed or can be computed during some point of the learning process.
We formulate a setting (setshort) as a POMDP where the latent states are revealed to the learner in hindsight and only during training.
arXiv Detail & Related papers (2023-01-31T18:54:36Z) - Optimality Guarantees for Particle Belief Approximation of POMDPs [55.83001584645448]
Partially observable Markov decision processes (POMDPs) provide a flexible representation for real-world decision and control problems.
POMDPs are notoriously difficult to solve, especially when the state and observation spaces are continuous or hybrid.
We propose a theory characterizing the approximation error of the particle filtering techniques that these algorithms use.
arXiv Detail & Related papers (2022-10-10T21:11:55Z) - Sublinear Regret for Learning POMDPs [5.675955495285045]
We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs)
We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models.
We establish a regret bound of $O(T2/3sqrtlog T)$ for the proposed learning algorithm where $T$ is the learning horizon.
arXiv Detail & Related papers (2021-07-08T06:59:39Z) - A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs [117.82903457289584]
We derive a novel problem-dependent lower-bound for regret in finite-horizon Markov Decision Processes (MDPs)
We show that our lower-bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all.
We show that this last result is attainable (up to $poly(H)$ terms, where $H$ is the horizon) by providing a regret upper-bound based on policy gaps for an optimistic algorithm.
arXiv Detail & Related papers (2021-06-24T13:46:09Z) - Online Learning for Unknown Partially Observable MDPs [11.458853556386797]
We consider infinite-horizon average-cost POMDPs with unknown transition model, though known observation model.
We propose a natural posterior sampling-based reinforcement learning algorithm (POMDP-PSRL) and show that it achieves $O(T2/3)$ regret where $T$ is the time horizon.
arXiv Detail & Related papers (2021-02-25T03:36:13Z) - RL for Latent MDPs: Regret Guarantees and a Lower Bound [74.41782017817808]
We consider the regret problem for reinforcement learning in latent Markov Decision Processes (LMDP)
In an LMDP, an MDP is randomly drawn from a set of $M$ possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent.
We show that the key link is a notion of separation between the MDP system dynamics.
arXiv Detail & Related papers (2021-02-09T16:49:58Z) - Sample-Efficient Reinforcement Learning of Undercomplete POMDPs [91.40308354344505]
This work shows that these hardness barriers do not preclude efficient reinforcement learning for rich and interesting subclasses of Partially Observable Decision Processes (POMDPs)
We present a sample-efficient algorithm, OOM-UCB, for episodic finite undercomplete POMDPs, where the number of observations is larger than the number of latent states and where exploration is essential for learning, thus distinguishing our results from prior works.
arXiv Detail & Related papers (2020-06-22T17:58:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.