Partially Observable RL with B-Stability: Unified Structural Condition
and Sharp Sample-Efficient Algorithms
- URL: http://arxiv.org/abs/2209.14990v1
- Date: Thu, 29 Sep 2022 17:51:51 GMT
- Title: Partially Observable RL with B-Stability: Unified Structural Condition
and Sharp Sample-Efficient Algorithms
- Authors: Fan Chen, Yu Bai, Song Mei
- Abstract summary: This paper advances all three aspects of Partially Observable RL in the general setting of Predictive State Representations (PSRs)
We propose a natural and unified structural condition for PSRs called emphB-stability. B-stable PSRs encompasses the vast majority of known tractable subclasses.
We show that any B-stable PSR can be learned with samples in relevant problem parameters. When instantiated in the aforementioned subclasses, our sample complexities improve.
- Score: 25.658930892561735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Partial Observability -- where agents can only observe partial information
about the true underlying state of the system -- is ubiquitous in real-world
applications of Reinforcement Learning (RL). Theoretically, learning a
near-optimal policy under partial observability is known to be hard in the
worst case due to an exponential sample complexity lower bound. Recent work has
identified several tractable subclasses that are learnable with polynomial
samples, such as Partially Observable Markov Decision Processes (POMDPs) with
certain revealing or decodability conditions. However, this line of research is
still in its infancy, where (1) unified structural conditions enabling
sample-efficient learning are lacking; (2) existing sample complexities for
known tractable subclasses are far from sharp; and (3) fewer sample-efficient
algorithms are available than in fully observable RL.
This paper advances all three aspects above for Partially Observable RL in
the general setting of Predictive State Representations (PSRs). First, we
propose a natural and unified structural condition for PSRs called
\emph{B-stability}. B-stable PSRs encompasses the vast majority of known
tractable subclasses such as weakly revealing POMDPs, low-rank
future-sufficient POMDPs, decodable POMDPs, and regular PSRs. Next, we show
that any B-stable PSR can be learned with polynomial samples in relevant
problem parameters. When instantiated in the aforementioned subclasses, our
sample complexities improve substantially over the current best ones. Finally,
our results are achieved by three algorithms simultaneously: Optimistic Maximum
Likelihood Estimation, Estimation-to-Decisions, and Model-Based Optimistic
Posterior Sampling. The latter two algorithms are new for sample-efficient
learning of POMDPs/PSRs.
Related papers
- Tractable Offline Learning of Regular Decision Processes [50.11277112628193]
This work studies offline Reinforcement Learning (RL) in a class of non-Markovian environments called Regular Decision Processes (RDPs)
Ins, the unknown dependency of future observations and rewards from the past interactions can be captured experimentally.
Many algorithms first reconstruct this unknown dependency using automata learning techniques.
arXiv Detail & Related papers (2024-09-04T14:26:58Z) - REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.
In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.
We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - Provably Efficient UCB-type Algorithms For Learning Predictive State
Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs)
This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models.
In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z) - GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP,
and Beyond [101.5329678997916]
We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making.
We propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation.
We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR.
arXiv Detail & Related papers (2022-11-03T16:42:40Z) - Optimistic MLE -- A Generic Model-based Algorithm for Partially
Observable Sequential Decision Making [48.87943416098096]
This paper introduces a simple efficient learning algorithms for general sequential decision making.
We prove that OMLE learns near-optimal policies of an enormously rich class of sequential decision making problems.
arXiv Detail & Related papers (2022-09-29T17:56:25Z) - PAC Reinforcement Learning for Predictive State Representations [60.00237613646686]
We study online Reinforcement Learning (RL) in partially observable dynamical systems.
We focus on the Predictive State Representations (PSRs) model, which is an expressive model that captures other well-known models.
We develop a novel model-based algorithm for PSRs that can learn a near optimal policy in sample complexity scalingly.
arXiv Detail & Related papers (2022-07-12T17:57:17Z) - When Is Partially Observable Reinforcement Learning Not Scary? [30.754810416907123]
We show that learning partially observable decision processes (POMDPs) requires an exponential number of samples in the worst case.
This is the first provably efficient result for learning from interactions in overcomplete POMDPs.
arXiv Detail & Related papers (2022-04-19T16:08:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.