Related papers: Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations

Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations

URL: http://arxiv.org/abs/2106.11519v1
Date: Tue, 22 Jun 2021 03:20:40 GMT
Title: Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations
Authors: Christoph Dann, Yishay Mansour, Mehryar Mohri, Ayush Sekhari and Karthik Sridharan
Abstract summary: We consider the more realistic setting of agnostic RL with rich observation spaces and a fixed class of policies $Pi$ that may not contain any near-optimal policy. We provide an algorithm for this setting whose error is bounded in terms of the rank $d$ of the underlying MDP.
Score: 79.66404989555566
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There have been many recent advances on provably efficient Reinforcement Learning (RL) in problems with rich observation spaces. However, all these works share a strong realizability assumption about the optimal value function of the true MDP. Such realizability assumptions are often too strong to hold in practice. In this work, we consider the more realistic setting of agnostic RL with rich observation spaces and a fixed class of policies $\Pi$ that may not contain any near-optimal policy. We provide an algorithm for this setting whose error is bounded in terms of the rank $d$ of the underlying MDP. Specifically, our algorithm enjoys a sample complexity bound of $\widetilde{O}\left((H^{4d} K^{3d} \log |\Pi|)/\epsilon^2\right)$ where $H$ is the length of episodes, $K$ is the number of actions and $\epsilon>0$ is the desired sub-optimality. We also provide a nearly matching lower bound for this agnostic setting that shows that the exponential dependence on rank is unavoidable, without further assumptions.

Related papers

Computational Hardness of Reinforcement Learning with Partial $q^π$-Realizability [1.6328866317851185]
This paper investigates the computational complexity of reinforcement learning in a novel linear function approximation regime, termed partial $qpi$-realizability.<n>We prove that learning an $epsilon$-optimal policy in this setting is computationally hard.<n>Our results mirror those in $q*$-realizability and suggest computational difficulty persists even when $Pi$ is expanded beyond the optimal policy.
arXiv Detail & Related papers (2025-10-24T01:18:49Z)
Towards Fundamental Limits for Active Multi-distribution Learning [16.639855803241524]
We develop new algorithms for active multi-distribution learning and establish improved label complexity upper and lower bounds.<n>We show that the bound in the realizable setting is information-theoretically optimal and that the $knu/varepsilon2$ term in the setting is fundamental for proper learners.
arXiv Detail & Related papers (2025-06-21T06:08:58Z)
Actor-Critics Can Achieve Optimal Sample Efficiency [15.033410073144939]
We introduce a novel actor-critic algorithm that attains a sample-complexity of $O(dH5 log|mathcalA|/epsilon2 + d H4 log|mathcalF|/ epsilon2)$ trajectories.<n>We extend this to the setting of Hybrid RL, showing that initializing the critic with offline data yields sample efficiency gains compared to purely offline or online RL.
arXiv Detail & Related papers (2025-05-06T17:32:39Z)
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel. An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z)
Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations. Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z)
Adversarial Online Multi-Task Reinforcement Learning [12.421997449847153]
We consider the adversarial online multi-task reinforcement learning setting. In each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models. The learner's objective is to generalize its regret with respect to the optimal policy for each task.
arXiv Detail & Related papers (2023-01-11T02:18:26Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model. The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z)
Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure [9.759209713196718]
We consider a class of MDPs for which the associated optimal $Q*$ function is low rank, where the latent features are unknown. We show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LR-MCPI) and Low Rank Empirical Value Iteration (LR-EVI) achieve the desired sample complexity of $tildeOleft((|S|+|A|)mathrmpoly(d,H)/epsilon2right)$ for a rank
arXiv Detail & Related papers (2022-06-07T20:39:51Z)
Provably Breaking the Quadratic Error Compounding Barrier in Imitation Learning, Optimally [58.463668865380946]
We study the statistical limits of Imitation Learning in episodic Markov Decision Processes (MDPs) with a state space $mathcalS$. We establish an upper bound $O(|mathcalS|H3/2/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al ( 2020) We show the minimax suboptimality grows as $Omega( H3/2/N)$ when $mathcalS|geq 3$ while the unknown-transition setting suffers from a larger sharp rate
arXiv Detail & Related papers (2021-02-25T15:50:19Z)
Towards Tractable Optimism in Model-Based Reinforcement Learning [37.51073590932658]
To be successful, an optimistic RL algorithm must over-estimate the true value function (optimism) but not by so much that it is inaccurate (estimation error) We re-interpret these scalable optimistic model-based algorithms as solving a tractable noise augmented MDP. We show that if this error is reduced, optimistic model-based RL algorithms can match state-of-the-art performance in continuous control problems.
arXiv Detail & Related papers (2020-06-21T20:53:19Z)
Learning Near Optimal Policies with Low Inherent Bellman Error [115.16037976819331]
We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning. We show that exploration is possible using only emphbatch assumptions with an algorithm that achieves the optimal statistical rate for the setting we consider.
arXiv Detail & Related papers (2020-02-29T02:02:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.