On the Sample Complexity of Reinforcement Learning with Policy Space
Generalization
- URL: http://arxiv.org/abs/2008.07353v1
- Date: Mon, 17 Aug 2020 14:26:18 GMT
- Title: On the Sample Complexity of Reinforcement Learning with Policy Space
Generalization
- Authors: Wenlong Mou, Zheng Wen, Xi Chen
- Abstract summary: We study the optimal sample complexity in large-scale Reinforcement Learning (RL) problems with policy space generalization.
Existing results show that without a generalization model, the sample complexity of an RL algorithm will inevitably depend on the cardinalities of state space and action space.
This paper proposes a new notion of eluder dimension for the policy space, which characterizes the intrinsic complexity of policy learning.
- Score: 21.879621917722613
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the optimal sample complexity in large-scale Reinforcement Learning
(RL) problems with policy space generalization, i.e. the agent has a prior
knowledge that the optimal policy lies in a known policy space. Existing
results show that without a generalization model, the sample complexity of an
RL algorithm will inevitably depend on the cardinalities of state space and
action space, which are intractably large in many practical problems.
To avoid such undesirable dependence on the state and action space sizes,
this paper proposes a new notion of eluder dimension for the policy space,
which characterizes the intrinsic complexity of policy learning in an arbitrary
Markov Decision Process (MDP). Using a simulator oracle, we prove a
near-optimal sample complexity upper bound that only depends linearly on the
eluder dimension. We further prove a similar regret bound in deterministic
systems without the simulator.
Related papers
- Sample Complexity of Offline Distributionally Robust Linear Markov Decision Processes [37.15580574143281]
offline reinforcement learning (RL)
This paper considers the sample complexity of distributionally robust linear Markov decision processes (MDPs) with an uncertainty set characterized by the total variation distance using offline data.
We develop a pessimistic model-based algorithm and establish its sample complexity bound under minimal data coverage assumptions.
arXiv Detail & Related papers (2024-03-19T17:48:42Z) - Distributionally Robust Model-based Reinforcement Learning with Large
State Spaces [55.14361269378122]
Three major challenges in reinforcement learning are the complex dynamical systems with large state spaces, the costly data acquisition processes, and the deviation of real-world dynamics from the training environment deployment.
We study distributionally robust Markov decision processes with continuous state spaces under the widely used Kullback-Leibler, chi-square, and total variation uncertainty sets.
We propose a model-based approach that utilizes Gaussian Processes and the maximum variance reduction algorithm to efficiently learn multi-output nominal transition dynamics.
arXiv Detail & Related papers (2023-09-05T13:42:11Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Near-optimal Policy Identification in Active Reinforcement Learning [84.27592560211909]
AE-LSVI is a novel variant of the kernelized least-squares value RL (LSVI) algorithm that combines optimism with pessimism for active exploration.
We show that AE-LSVI outperforms other algorithms in a variety of environments when robustness to the initial state is required.
arXiv Detail & Related papers (2022-12-19T14:46:57Z) - PAC Reinforcement Learning for Predictive State Representations [60.00237613646686]
We study online Reinforcement Learning (RL) in partially observable dynamical systems.
We focus on the Predictive State Representations (PSRs) model, which is an expressive model that captures other well-known models.
We develop a novel model-based algorithm for PSRs that can learn a near optimal policy in sample complexity scalingly.
arXiv Detail & Related papers (2022-07-12T17:57:17Z) - Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via
Online Experiment Design [12.056495277232118]
This work seeks to understand the "instance-dependent" complexity of learning near-optimal policies.
We propose an algorithm, textscPedel, which achieves a fine-grained instance-dependent measure of complexity.
We show that textscPedel yields provable gains over low-regret, minimax-optimal algorithms.
arXiv Detail & Related papers (2022-07-06T10:42:57Z) - Provable Reinforcement Learning with a Short-Term Memory [68.00677878812908]
We study a new subclass of POMDPs, whose latent states can be decoded by the most recent history of a short length $m$.
In particular, in the rich-observation setting, we develop new algorithms using a novel "moment matching" approach with a sample complexity that scales exponentially.
Our results show that a short-term memory suffices for reinforcement learning in these environments.
arXiv Detail & Related papers (2022-02-08T16:39:57Z) - Reinforcement Learning for Adaptive Mesh Refinement [63.7867809197671]
We propose a novel formulation of AMR as a Markov decision process and apply deep reinforcement learning to train refinement policies directly from simulation.
The model sizes of these policy architectures are independent of the mesh size and hence scale to arbitrarily large and complex simulations.
arXiv Detail & Related papers (2021-03-01T22:55:48Z) - Learning with Safety Constraints: Sample Complexity of Reinforcement
Learning for Constrained MDPs [13.922754427601491]
We characterize the relationship between safety constraints and the number of samples needed to ensure a desired level of accuracy.
Our main finding is that compared to the best known bounds of the unconstrained regime, the sample of constrained RL algorithms are increased by a factor that is logarithmic in the number of constraints.
arXiv Detail & Related papers (2020-08-01T18:17:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.