Sublinear Regret for Learning POMDPs
- URL: http://arxiv.org/abs/2107.03635v1
- Date: Thu, 8 Jul 2021 06:59:39 GMT
- Title: Sublinear Regret for Learning POMDPs
- Authors: Yi Xiong, Ningyuan Chen, Xuefeng Gao, Xiang Zhou
- Abstract summary: We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs)
We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models.
We establish a regret bound of $O(T2/3sqrtlog T)$ for the proposed learning algorithm where $T$ is the learning horizon.
- Score: 5.675955495285045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the model-based undiscounted reinforcement learning for partially
observable Markov decision processes (POMDPs). The oracle we consider is the
optimal policy of the POMDP with a known environment in terms of the average
reward over an infinite horizon. We propose a learning algorithm for this
problem, building on spectral method-of-moments estimations for hidden Markov
models, the belief error control in POMDPs and upper-confidence-bound methods
for online learning. We establish a regret bound of $O(T^{2/3}\sqrt{\log T})$
for the proposed learning algorithm where $T$ is the learning horizon. This is,
to the best of our knowledge, the first algorithm achieving sublinear regret
with respect to our oracle for learning general POMDPs.
Related papers
- Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy.
We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z) - Imitation Learning in Discounted Linear MDPs without exploration assumptions [58.81226849657474]
We present a new algorithm for imitation learning in infinite horizon linear MDPs dubbed ILARL.
We improve the dependence on the desired accuracy $epsilon$ from $mathcalO(epsilon-5)$ to $mathcalO(epsilon-4)$.
Numerical experiments with linear function approximation shows that ILARL outperforms other commonly used algorithms.
arXiv Detail & Related papers (2024-05-03T15:28:44Z) - Truly No-Regret Learning in Constrained MDPs [61.78619476991494]
We propose a model-based primal-dual algorithm to learn in an unknown CMDP.
We prove that our algorithm achieves sublinear regret without error cancellations.
arXiv Detail & Related papers (2024-02-24T09:47:46Z) - Posterior Sampling-based Online Learning for Episodic POMDPs [5.797837329787459]
We consider the online learning problem for episodic POMDPs with unknown transition and observation models.
We propose a Posterior Sampling-based reinforcement learning algorithm for POMDPs.
arXiv Detail & Related papers (2023-10-16T06:41:13Z) - On Reward Structures of Markov Decision Processes [4.13365552362244]
A Markov decision process can be parameterized by a transition kernel and a reward function.
We study various kinds of "costs" associated with reinforcement learning inspired by the demands in robotic applications.
We develop a novel estimator with an instance-specific error bound to $tildeO(sqrtfractau_sn)$ for estimating a single state value.
arXiv Detail & Related papers (2023-08-28T22:29:16Z) - Improved Regret for Efficient Online Reinforcement Learning with Linear
Function Approximation [69.0695698566235]
We study reinforcement learning with linear function approximation and adversarially changing cost functions.
We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback.
arXiv Detail & Related papers (2023-01-30T17:26:39Z) - Learning in Observable POMDPs, without Computationally Intractable
Oracles [23.636033995089587]
We develop the first oracle-free learning algorithm for POMDPs under reasonable assumptions.
Specifically, we give a quasipolynomial-time end-to-end algorithm for learning in "observable" POMDPs, where observability is the assumption that well-separated distributions over states induce well-separated distributions over observations.
arXiv Detail & Related papers (2022-06-07T17:05:27Z) - Online Learning for Unknown Partially Observable MDPs [11.458853556386797]
We consider infinite-horizon average-cost POMDPs with unknown transition model, though known observation model.
We propose a natural posterior sampling-based reinforcement learning algorithm (POMDP-PSRL) and show that it achieves $O(T2/3)$ regret where $T$ is the time horizon.
arXiv Detail & Related papers (2021-02-25T03:36:13Z) - Online Apprenticeship Learning [58.45089581278177]
In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function.
The goal is to find a policy that matches the expert's performance on some predefined set of cost functions.
We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms.
arXiv Detail & Related papers (2021-02-13T12:57:51Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.