Related papers: Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

URL: http://arxiv.org/abs/2210.08238v1
Date: Sat, 15 Oct 2022 09:22:22 GMT
Title: Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning
Authors: Zihan Zhang, Yuhang Jiang, Yuan Zhou and Xiangyang Ji
Abstract summary: We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
Score: 54.806166861456035
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with $S$ states, $A$ actions and planning horizon $H$, we design a computational efficient algorithm to achieve near-optimal regret of $\tilde{O}(\sqrt{SAH^3K\ln(1/\delta)})$\footnote{$\tilde{O}(\cdot)$ hides logarithmic terms of $(S,A,H,K)$} in $K$ episodes using $O\left(H+\log_2\log_2(K) \right)$ batches with confidence parameter $\delta$. To our best of knowledge, it is the first $\tilde{O}(\sqrt{SAH^3K})$ regret bound with $O(H+\log_2\log_2(K))$ batch complexity. Meanwhile, we show that to achieve $\tilde{O}(\mathrm{poly}(S,A,H)\sqrt{K})$ regret, the number of batches is at least $\Omega\left(H/\log_A(K)+ \log_2\log_2(K) \right)$, which matches our upper bound up to logarithmic terms. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model.

Related papers

Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation [3.2703356989962518]
We study model-based reinforcement learning with non-linear function approximation. We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings.
arXiv Detail & Related papers (2024-06-19T15:29:14Z)
Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints [21.412085244757336]
We study the problem of multi-agent reinforcement learning (MARL) with adaptivity constraints. For two-player zero-sum Markov Games, we design a (policy) elimination based algorithm that achieves a regret of $widetildeO(sqrtH3 S2 ABK)$. We prove a batch complexity lower bound $Omega(fracHlog_AK+loglog K)$ for all algorithms with $widetildeO(sqrtK)$
arXiv Detail & Related papers (2024-02-02T03:00:40Z)
Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs) Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z)
Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost [31.04961854943877]
We propose a new algorithm that achieves a regret of $widetildeO(sqrtH4S2AT)$ while requiring a switching cost of $O(HSA loglog T)$. As a byproduct, our new algorithmic techniques allow us to derive a emphreward-free exploration algorithm with an optimal switching cost of $O(HSA)$.
arXiv Detail & Related papers (2022-02-13T19:01:06Z)
Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints [94.76881135901753]
We consider two popular limited adaptivity models: batch learning model and rare policy switch model. Our proposed LSVI-UCB-Batch algorithm achieves an $tilde O(sqrtd3H3T + dHT/B)$ regret. For the rare policy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an $tilde O(sqrtd3H3T[1+T/(dH)]dH/B)$ regret.
arXiv Detail & Related papers (2021-01-06T18:56:07Z)
Logarithmic Regret for Reinforcement Learning with Linear Function Approximation [99.59319332864129]
We show that logarithmic regret is attainable under two recently proposed linear MDP assumptions. To the best of our knowledge, these are the first logarithmic regret bounds for RL with linear function approximation.
arXiv Detail & Related papers (2020-11-23T17:25:00Z)
$Q$-learning with Logarithmic Regret [60.24952657636464]
We prove that an optimistic $Q$-learning enjoys a $mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Delta_min$ is the minimum sub-optimality gap.
arXiv Detail & Related papers (2020-06-16T13:01:33Z)
Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity [94.37110094442136]
We study the problem of agnostic $Q$-learning with function approximation in deterministic systems. We show that if $delta = Oleft(rho/sqrtdim_Eright)$, then one can find the optimal policy using $Oleft(dim_Eright)$.
arXiv Detail & Related papers (2020-02-17T18:41:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.