Improved Regret Bound and Experience Replay in Regularized Policy
Iteration
- URL: http://arxiv.org/abs/2102.12611v1
- Date: Thu, 25 Feb 2021 00:55:07 GMT
- Title: Improved Regret Bound and Experience Replay in Regularized Policy
Iteration
- Authors: Nevena Lazic, Dong Yin, Yasin Abbasi-Yadkori, Csaba Szepesvari
- Abstract summary: We study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation.
We first show that the regret analysis of the Politex algorithm can be sharpened from $O(T3/4)$ to $O(sqrtT)$ under nearly identical assumptions.
Our result provides the first high-probability $O(sqrtT)$ regret bound for a computationally efficient algorithm in this setting.
- Score: 22.621710838468097
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we study algorithms for learning in infinite-horizon
undiscounted Markov decision processes (MDPs) with function approximation. We
first show that the regret analysis of the Politex algorithm (a version of
regularized policy iteration) can be sharpened from $O(T^{3/4})$ to
$O(\sqrt{T})$ under nearly identical assumptions, and instantiate the bound
with linear function approximation. Our result provides the first
high-probability $O(\sqrt{T})$ regret bound for a computationally efficient
algorithm in this setting. The exact implementation of Politex with neural
network function approximation is inefficient in terms of memory and
computation. Since our analysis suggests that we need to approximate the
average of the action-value functions of past policies well, we propose a
simple efficient implementation where we train a single Q-function on a replay
buffer with past data. We show that this often leads to superior performance
over other implementation choices, especially in terms of wall-clock time. Our
work also provides a novel theoretical justification for using experience
replay within policy iteration algorithms.
Related papers
- Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs [44.69257217086967]
The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives.
In this paper, we address the learning problem given linear function approximation with $q_pi$-realizability.
arXiv Detail & Related papers (2024-06-26T17:57:13Z) - Achieving Tractable Minimax Optimal Regret in Average Reward MDPs [19.663336027878408]
We present the first tractable algorithm with minimax optimal regret of $widetildemathrmO(sqrtmathrmsp(h*) S A T)$.
Remarkably, our algorithm does not require prior information on $mathrmsp(h*)$.
arXiv Detail & Related papers (2024-06-03T11:53:44Z) - A New Policy Iteration Algorithm For Reinforcement Learning in Zero-Sum
Markov Games [10.805520579293747]
We show that a simple variant of naive policy iteration for games converges exponentially fast.
We also show that lookahead policies can be implemented efficiently in the function approximation setting of linear Markov games.
arXiv Detail & Related papers (2023-03-17T01:20:22Z) - Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations.
Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function
Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback.
We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode.
We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.