Learning in Observable POMDPs, without Computationally Intractable
Oracles
- URL: http://arxiv.org/abs/2206.03446v1
- Date: Tue, 7 Jun 2022 17:05:27 GMT
- Title: Learning in Observable POMDPs, without Computationally Intractable
Oracles
- Authors: Noah Golowich, Ankur Moitra, Dhruv Rohatgi
- Abstract summary: We develop the first oracle-free learning algorithm for POMDPs under reasonable assumptions.
Specifically, we give a quasipolynomial-time end-to-end algorithm for learning in "observable" POMDPs, where observability is the assumption that well-separated distributions over states induce well-separated distributions over observations.
- Score: 23.636033995089587
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Much of reinforcement learning theory is built on top of oracles that are
computationally hard to implement. Specifically for learning near-optimal
policies in Partially Observable Markov Decision Processes (POMDPs), existing
algorithms either need to make strong assumptions about the model dynamics
(e.g. deterministic transitions) or assume access to an oracle for solving a
hard optimistic planning or estimation problem as a subroutine. In this work we
develop the first oracle-free learning algorithm for POMDPs under reasonable
assumptions. Specifically, we give a quasipolynomial-time end-to-end algorithm
for learning in "observable" POMDPs, where observability is the assumption that
well-separated distributions over states induce well-separated distributions
over observations. Our techniques circumvent the more traditional approach of
using the principle of optimism under uncertainty to promote exploration, and
instead give a novel application of barycentric spanners to constructing policy
covers.
Related papers
- Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy.
We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z) - Learning Logic Specifications for Policy Guidance in POMDPs: an
Inductive Logic Programming Approach [57.788675205519986]
We learn high-quality traces from POMDP executions generated by any solver.
We exploit data- and time-efficient Indu Logic Programming (ILP) to generate interpretable belief-based policy specifications.
We show that learneds expressed in Answer Set Programming (ASP) yield performance superior to neural networks and similar to optimal handcrafted task-specifics within lower computational time.
arXiv Detail & Related papers (2024-02-29T15:36:01Z) - Risk-sensitive Markov Decision Process and Learning under General
Utility Functions [3.6260136172126667]
Reinforcement Learning (RL) has gained substantial attention across diverse application domains and theoretical investigations.
We propose a modified value algorithm that employs an epsilon-covering over the space of cumulative reward.
In the absence of a simulator, our algorithm, designed with an upper-confidence-bound exploration approach, identifies a near-optimal policy.
arXiv Detail & Related papers (2023-11-22T18:50:06Z) - Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learning [83.41487567765871]
Skipper is a model-based reinforcement learning framework.
It automatically generalizes the task given into smaller, more manageable subtasks.
It enables sparse decision-making and focused abstractions on the relevant parts of the environment.
arXiv Detail & Related papers (2023-09-30T02:25:18Z) - Provably Efficient UCB-type Algorithms For Learning Predictive State
Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs)
This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models.
In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z) - Provably Efficient Representation Learning with Tractable Planning in
Low-Rank POMDP [81.00800920928621]
We study representation learning in partially observable Markov Decision Processes (POMDPs)
We first present an algorithm for decodable POMDPs that combines maximum likelihood estimation (MLE) and optimism in the face of uncertainty (OFU)
We then show how to adapt this algorithm to also work in the broader class of $gamma$-observable POMDPs.
arXiv Detail & Related papers (2023-06-21T16:04:03Z) - Oracle-Efficient Pessimism: Offline Policy Optimization in Contextual
Bandits [82.28442917447643]
We present the first general oracle-efficient algorithm for pessimistic OPO.
We obtain statistical guarantees analogous to those for prior pessimistic approaches.
We show advantage over unregularized OPO across a wide range of configurations.
arXiv Detail & Related papers (2023-06-13T17:29:50Z) - Planning in Observable POMDPs in Quasipolynomial Time [21.03037504572896]
We develop a quasipolynomial-time algorithm for planning in observable POMDPs.
We assume that well-separated distributions on states lead to well-separated distributions on observations.
We prove matching hardness for planning in observable POMDPs under the Exponential Time Hypothesis.
arXiv Detail & Related papers (2022-01-12T23:16:37Z) - Pessimistic Model-based Offline RL: PAC Bounds and Posterior Sampling
under Partial Coverage [33.766012922307084]
We study model-based offline Reinforcement Learning with general function approximation.
We present an algorithm named Constrained Policy Optimization (CPPO) which leverages a general function class and uses a constraint to encode pessimism.
arXiv Detail & Related papers (2021-07-13T16:30:01Z) - Sublinear Regret for Learning POMDPs [5.675955495285045]
We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs)
We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models.
We establish a regret bound of $O(T2/3sqrtlog T)$ for the proposed learning algorithm where $T$ is the learning horizon.
arXiv Detail & Related papers (2021-07-08T06:59:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.