Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning
- URL: http://arxiv.org/abs/2210.08238v1
- Date: Sat, 15 Oct 2022 09:22:22 GMT
- Title: Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning
- Authors: Zihan Zhang, Yuhang Jiang, Yuan Zhou and Xiangyang Ji
- Abstract summary: We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
- Score: 54.806166861456035
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper,
we study the episodic reinforcement learning (RL) problem modeled by
finite-horizon Markov Decision Processes (MDPs) with constraint on the number
of batches. The multi-batch reinforcement learning framework, where the agent
is required to provide a time schedule to update policy before everything,
which is particularly suitable for the scenarios where the agent suffers
extensively from changing the policy adaptively. Given a finite-horizon MDP
with $S$ states, $A$ actions and planning horizon $H$, we design a
computational efficient algorithm to achieve near-optimal regret of
$\tilde{O}(\sqrt{SAH^3K\ln(1/\delta)})$\footnote{$\tilde{O}(\cdot)$ hides
logarithmic terms of $(S,A,H,K)$} in $K$ episodes using
$O\left(H+\log_2\log_2(K) \right)$ batches with confidence parameter $\delta$.
To our best of knowledge, it is the first $\tilde{O}(\sqrt{SAH^3K})$ regret
bound with $O(H+\log_2\log_2(K))$ batch complexity. Meanwhile, we show that to
achieve $\tilde{O}(\mathrm{poly}(S,A,H)\sqrt{K})$ regret, the number of batches
is at least $\Omega\left(H/\log_A(K)+ \log_2\log_2(K) \right)$, which matches
our upper bound up to logarithmic terms.
Our technical contribution are two-fold: 1) a near-optimal design scheme to
explore over the unlearned states; 2) an computational efficient algorithm to
explore certain directions with an approximated transition model.
Related papers
- Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation [3.2703356989962518]
We study model-based reinforcement learning with non-linear function approximation.
We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings.
arXiv Detail & Related papers (2024-06-19T15:29:14Z) - Near-Optimal Reinforcement Learning with Self-Play under Adaptivity
Constraints [21.412085244757336]
We study the problem of multi-agent reinforcement learning (MARL) with adaptivity constraints.
For two-player zero-sum Markov Games, we design a (policy) elimination based algorithm that achieves a regret of $widetildeO(sqrtH3 S2 ABK)$.
We prove a batch complexity lower bound $Omega(fracHlog_AK+loglog K)$ for all algorithms with $widetildeO(sqrtK)$
arXiv Detail & Related papers (2024-02-02T03:00:40Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost [31.04961854943877]
We propose a new algorithm that achieves a regret of $widetildeO(sqrtH4S2AT)$ while requiring a switching cost of $O(HSA loglog T)$.
As a byproduct, our new algorithmic techniques allow us to derive a emphreward-free exploration algorithm with an optimal switching cost of $O(HSA)$.
arXiv Detail & Related papers (2022-02-13T19:01:06Z) - Provably Efficient Reinforcement Learning with Linear Function
Approximation Under Adaptivity Constraints [94.76881135901753]
We consider two popular limited adaptivity models: batch learning model and rare policy switch model.
Our proposed LSVI-UCB-Batch algorithm achieves an $tilde O(sqrtd3H3T + dHT/B)$ regret.
For the rare policy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an $tilde O(sqrtd3H3T[1+T/(dH)]dH/B)$ regret.
arXiv Detail & Related papers (2021-01-06T18:56:07Z) - Logarithmic Regret for Reinforcement Learning with Linear Function
Approximation [99.59319332864129]
We show that logarithmic regret is attainable under two recently proposed linear MDP assumptions.
To the best of our knowledge, these are the first logarithmic regret bounds for RL with linear function approximation.
arXiv Detail & Related papers (2020-11-23T17:25:00Z) - $Q$-learning with Logarithmic Regret [60.24952657636464]
We prove that an optimistic $Q$-learning enjoys a $mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Delta_min$ is the minimum sub-optimality gap.
arXiv Detail & Related papers (2020-06-16T13:01:33Z) - Agnostic Q-learning with Function Approximation in Deterministic
Systems: Tight Bounds on Approximation Error and Sample Complexity [94.37110094442136]
We study the problem of agnostic $Q$-learning with function approximation in deterministic systems.
We show that if $delta = Oleft(rho/sqrtdim_Eright)$, then one can find the optimal policy using $Oleft(dim_Eright)$.
arXiv Detail & Related papers (2020-02-17T18:41:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.