Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency
- URL: http://arxiv.org/abs/2205.13476v2
- Date: Mon, 1 Apr 2024 01:53:31 GMT
- Title: Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency
- Authors: Lingxiao Wang, Qi Cai, Zhuoran Yang, Zhaoran Wang,
- Abstract summary: Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges.
It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon.
We propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.
- Score: 105.17746223041954
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.~(i) For each step, ETC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, ETC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, ETC attains an $O(1/\epsilon^2)$ sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here $\epsilon$ is the optimality gap. To our best knowledge, ETC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces.
Related papers
- Near-optimal Policy Identification in Active Reinforcement Learning [84.27592560211909]
AE-LSVI is a novel variant of the kernelized least-squares value RL (LSVI) algorithm that combines optimism with pessimism for active exploration.
We show that AE-LSVI outperforms other algorithms in a variety of environments when robustness to the initial state is required.
arXiv Detail & Related papers (2022-12-19T14:46:57Z) - DIAMOND: Taming Sample and Communication Complexities in Decentralized
Bilevel Optimization [27.317118892531827]
We develop a new decentralized bilevel optimization called DIAMOND (decentralized single-timescale approximation with momentum and gradient-tracking)
We show that the DIAMOND enjoys $mathcalO(epsilon-3/2)$ in sample and communication complexities for achieving an $epsilon$-stationary solution.
arXiv Detail & Related papers (2022-12-05T15:58:00Z) - GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP,
and Beyond [101.5329678997916]
We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making.
We propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation.
We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR.
arXiv Detail & Related papers (2022-11-03T16:42:40Z) - PAC Reinforcement Learning for Predictive State Representations [60.00237613646686]
We study online Reinforcement Learning (RL) in partially observable dynamical systems.
We focus on the Predictive State Representations (PSRs) model, which is an expressive model that captures other well-known models.
We develop a novel model-based algorithm for PSRs that can learn a near optimal policy in sample complexity scalingly.
arXiv Detail & Related papers (2022-07-12T17:57:17Z) - Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency [111.83670279016599]
We study reinforcement learning for partially observed decision processes (POMDPs) with infinite observation and state spaces.
We make the first attempt at partial observability and function approximation for a class of POMDPs with a linear structure.
arXiv Detail & Related papers (2022-04-20T21:15:38Z) - Sample-Efficient Reinforcement Learning of Undercomplete POMDPs [91.40308354344505]
This work shows that these hardness barriers do not preclude efficient reinforcement learning for rich and interesting subclasses of Partially Observable Decision Processes (POMDPs)
We present a sample-efficient algorithm, OOM-UCB, for episodic finite undercomplete POMDPs, where the number of observations is larger than the number of latent states and where exploration is essential for learning, thus distinguishing our results from prior works.
arXiv Detail & Related papers (2020-06-22T17:58:54Z) - A Second look at Exponential and Cosine Step Sizes: Simplicity,
Adaptivity, and Performance [23.89815527019194]
Gradient Descent (SGD) is a popular tool in large-scale machine learning models.
It is highly variable, depending crucially on the choice of the step sizes.
A variety of strategies for tuning the step sizes have been proposed.
arXiv Detail & Related papers (2020-02-12T23:10:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.