Double Thompson Sampling in Finite stochastic Games
- URL: http://arxiv.org/abs/2202.10008v1
- Date: Mon, 21 Feb 2022 06:11:51 GMT
- Title: Double Thompson Sampling in Finite stochastic Games
- Authors: Shuqing Shi, Xiaobin Wang, Zhiyou Yang, Fan Zhang and Hong Qu
- Abstract summary: We consider the trade-off problem between exploration and exploitation under finite discounted Markov Decision Process.
We propose a double Thompson sampling reinforcement learning algorithm(DTS) to solve this kind of problem.
- Score: 10.559241770674724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the trade-off problem between exploration and exploitation under
finite discounted Markov Decision Process, where the state transition matrix of
the underlying environment stays unknown. We propose a double Thompson sampling
reinforcement learning algorithm(DTS) to solve this kind of problem. This
algorithm achieves a total regret bound of
$\tilde{\mathcal{O}}(D\sqrt{SAT})$\footnote{The symbol $\tilde{\mathcal{O}}$
means $\mathcal{O}$ with log factors ignored} in time horizon $T$ with $S$
states, $A$ actions and diameter $D$. DTS consists of two parts, the first part
is the traditional part where we apply the posterior sampling method on
transition matrix based on prior distribution. In the second part, we employ a
count-based posterior update method to balance between the local optimal action
and the long-term optimal action in order to find the global optimal game
value. We established a regret bound of $\tilde{\mathcal{O}}(\sqrt{T}/S^{2})$.
Which is by far the best regret bound for finite discounted Markov Decision
Process to our knowledge. Numerical results proves the efficiency and
superiority of our approach.
Related papers
- Best-of-Both-Worlds Algorithms for Linear Contextual Bandits [11.94312915280916]
We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits.
Our algorithms deliver near-optimal regret bounds in both the adversarial and adversarial regimes.
arXiv Detail & Related papers (2023-12-24T08:27:30Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Computationally Efficient Horizon-Free Reinforcement Learning for Linear
Mixture MDPs [111.75736569611159]
We propose the first computationally efficient horizon-free algorithm for linear mixture MDPs.
Our algorithm adapts a weighted least square estimator for the unknown transitional dynamic.
This also improves upon the best-known algorithms in this setting when $sigma_k2$'s are known.
arXiv Detail & Related papers (2022-05-23T17:59:18Z) - On Unbalanced Optimal Transport: Gradient Methods, Sparsity and
Approximation Error [18.19398247972205]
We study the Unbalanced Optimal Transport (UOT) between two measures of possibly different masses with at most $n$ components.
We propose a novel algorithm based on Gradient Extrapolation Method (GEM-UOT) to find an $varepsilon$-approximate solution to the UOT problem.
arXiv Detail & Related papers (2022-02-08T03:22:39Z) - The best of both worlds: stochastic and adversarial episodic MDPs with
unknown transition [49.78053380710322]
We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through $T$ episodes.
Recent work by [Jin and Luo, 2020] achieves this goal when the fixed transition is known.
In this work, we resolve this open problem by using the same Follow-the-Regularized-Leader ($textFTRL$) framework together with a set of new techniques.
arXiv Detail & Related papers (2021-06-08T05:46:35Z) - Provably Breaking the Quadratic Error Compounding Barrier in Imitation
Learning, Optimally [58.463668865380946]
We study the statistical limits of Imitation Learning in episodic Markov Decision Processes (MDPs) with a state space $mathcalS$.
We establish an upper bound $O(|mathcalS|H3/2/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al ( 2020)
We show the minimax suboptimality grows as $Omega( H3/2/N)$ when $mathcalS|geq 3$ while the unknown-transition setting suffers from a larger sharp rate
arXiv Detail & Related papers (2021-02-25T15:50:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.