Provably Efficient Reinforcement Learning via Surprise Bound
- URL: http://arxiv.org/abs/2302.11634v1
- Date: Wed, 22 Feb 2023 20:21:25 GMT
- Title: Provably Efficient Reinforcement Learning via Surprise Bound
- Authors: Hanlin Zhu, Ruosong Wang, Jason D. Lee
- Abstract summary: We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations.
Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
- Score: 66.15308700413814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Value function approximation is important in modern reinforcement learning
(RL) problems especially when the state space is (infinitely) large. Despite
the importance and wide applicability of value function approximation, its
theoretical understanding is still not as sophisticated as its empirical
success, especially in the context of general function approximation. In this
paper, we propose a provably efficient RL algorithm (both computationally and
statistically) with general value function approximations. We show that if the
value functions can be approximated by a function class that satisfies the
Bellman-completeness assumption, our algorithm achieves an
$\widetilde{O}(\text{poly}(\iota H)\sqrt{T})$ regret bound where $\iota$ is the
product of the surprise bound and log-covering numbers, $H$ is the planning
horizon, $K$ is the number of episodes and $T = HK$ is the total number of
steps the agent interacts with the environment. Our algorithm achieves
reasonable regret bounds when applied to both the linear setting and the sparse
high-dimensional linear setting. Moreover, our algorithm only needs to solve
$O(H\log K)$ empirical risk minimization (ERM) problems, which is far more
efficient than previous algorithms that need to solve ERM problems for
$\Omega(HK)$ times.
Related papers
- Achieving Tractable Minimax Optimal Regret in Average Reward MDPs [19.663336027878408]
We present the first tractable algorithm with minimax optimal regret of $widetildemathrmO(sqrtmathrmsp(h*) S A T)$.
Remarkably, our algorithm does not require prior information on $mathrmsp(h*)$.
arXiv Detail & Related papers (2024-06-03T11:53:44Z) - Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation [67.8414514524356]
We study a new class of MDPs that employs multinomial logit (MNL) function approximation to ensure valid probability distributions over the state space.
introducing the non-linear function raises significant challenges in both computational and statistical efficiency.
We propose an algorithm that achieves the same regret with only $mathcalO(1)$ cost.
arXiv Detail & Related papers (2024-05-27T11:31:54Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - Improved Regret Bound and Experience Replay in Regularized Policy
Iteration [22.621710838468097]
We study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation.
We first show that the regret analysis of the Politex algorithm can be sharpened from $O(T3/4)$ to $O(sqrtT)$ under nearly identical assumptions.
Our result provides the first high-probability $O(sqrtT)$ regret bound for a computationally efficient algorithm in this setting.
arXiv Detail & Related papers (2021-02-25T00:55:07Z) - Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function
Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback.
We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode.
We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z) - Reinforcement Learning with General Value Function Approximation:
Provably Efficient Approach via Bounded Eluder Dimension [124.7752517531109]
We establish a provably efficient reinforcement learning algorithm with general value function approximation.
We show that our algorithm achieves a regret bound of $widetildeO(mathrmpoly(dH)sqrtT)$ where $d$ is a complexity measure.
Our theory generalizes recent progress on RL with linear value function approximation and does not make explicit assumptions on the model of the environment.
arXiv Detail & Related papers (2020-05-21T17:36:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.