Provably Efficient Reinforcement Learning in Partially Observable
Dynamical Systems
- URL: http://arxiv.org/abs/2206.12020v1
- Date: Fri, 24 Jun 2022 00:27:42 GMT
- Title: Provably Efficient Reinforcement Learning in Partially Observable
Dynamical Systems
- Authors: Masatoshi Uehara, Ayush Sekhari, Jason D. Lee, Nathan Kallus, Wen Sun
- Abstract summary: We study Reinforcement Learning for partially observable dynamical systems using function approximation.
We propose a new textitPartially Observable Bilinear Actor-Critic framework, that is general enough to include models such as POMDPs, observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations (PSRs), as well as a newly introduced model Hilbert Space Embeddings of POMDPs and observable POMDPs with latent low-rank transition.
- Score: 97.12538243736705
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study Reinforcement Learning for partially observable dynamical systems
using function approximation. We propose a new \textit{Partially Observable
Bilinear Actor-Critic framework}, that is general enough to include models such
as observable tabular Partially Observable Markov Decision Processes (POMDPs),
observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations
(PSRs), as well as a newly introduced model Hilbert Space Embeddings of POMDPs
and observable POMDPs with latent low-rank transition. Under this framework, we
propose an actor-critic style algorithm that is capable of performing agnostic
policy learning. Given a policy class that consists of memory based policies
(that look at a fixed-length window of recent observations), and a value
function class that consists of functions taking both memory and future
observations as inputs, our algorithm learns to compete against the best
memory-based policy in the given policy class. For certain examples such as
undercomplete observable tabular POMDPs, observable LQGs and observable POMDPs
with latent low-rank transition, by implicitly leveraging their special
properties, our algorithm is even capable of competing against the globally
optimal policy without paying an exponential dependence on the horizon in its
sample complexity.
Related papers
- REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.
In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.
We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - Provably Efficient Representation Learning with Tractable Planning in
Low-Rank POMDP [81.00800920928621]
We study representation learning in partially observable Markov Decision Processes (POMDPs)
We first present an algorithm for decodable POMDPs that combines maximum likelihood estimation (MLE) and optimism in the face of uncertainty (OFU)
We then show how to adapt this algorithm to also work in the broader class of $gamma$-observable POMDPs.
arXiv Detail & Related papers (2023-06-21T16:04:03Z) - GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP,
and Beyond [101.5329678997916]
We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making.
We propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation.
We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR.
arXiv Detail & Related papers (2022-11-03T16:42:40Z) - Future-Dependent Value-Based Off-Policy Evaluation in POMDPs [67.21319339512699]
We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation.
We develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs.
We extend our methods to learning of dynamics and establish the connection between our approach and the well-known spectral learning methods in POMDPs.
arXiv Detail & Related papers (2022-07-26T17:53:29Z) - Towards Using Fully Observable Policies for POMDPs [0.0]
Partially Observable Markov Decision Process (POMDP) is a framework applicable to many real world problems.
We propose an approach to solve POMDPs with multimodal belief by relying on a policy that solves the fully observable version.
arXiv Detail & Related papers (2022-07-24T13:22:13Z) - PAC Reinforcement Learning for Predictive State Representations [60.00237613646686]
We study online Reinforcement Learning (RL) in partially observable dynamical systems.
We focus on the Predictive State Representations (PSRs) model, which is an expressive model that captures other well-known models.
We develop a novel model-based algorithm for PSRs that can learn a near optimal policy in sample complexity scalingly.
arXiv Detail & Related papers (2022-07-12T17:57:17Z) - Pessimistic Model-based Offline RL: PAC Bounds and Posterior Sampling
under Partial Coverage [33.766012922307084]
We study model-based offline Reinforcement Learning with general function approximation.
We present an algorithm named Constrained Policy Optimization (CPPO) which leverages a general function class and uses a constraint to encode pessimism.
arXiv Detail & Related papers (2021-07-13T16:30:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.