Online Markov Decision Processes with Non-oblivious Strategic Adversary
- URL: http://arxiv.org/abs/2110.03604v2
- Date: Fri, 8 Oct 2021 09:01:06 GMT
- Title: Online Markov Decision Processes with Non-oblivious Strategic Adversary
- Authors: Le Cong Dinh, David Henry Mguni, Long Tran-Thanh, Jun Wang, Yaodong
Yang
- Abstract summary: We study a novel setting in Online Markov Decision Processes (OMDPs) where the loss function is chosen by a non-oblivious strategic adversary.
We first demonstrate that MDP-Expert, an existing algorithm that works well with oblivious adversaries can still apply and achieve a policy regret bound of $mathcalO(sqrtTlog(L)+tau2sqrt T log(|A|))$ where $L$ is the size of adversary's pure strategy set and $|A|$ denotes the size
- Score: 35.25621843666453
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study a novel setting in Online Markov Decision Processes (OMDPs) where
the loss function is chosen by a non-oblivious strategic adversary who follows
a no-external regret algorithm. In this setting, we first demonstrate that
MDP-Expert, an existing algorithm that works well with oblivious adversaries
can still apply and achieve a policy regret bound of $\mathcal{O}(\sqrt{T
\log(L)}+\tau^2\sqrt{ T \log(|A|)})$ where $L$ is the size of adversary's pure
strategy set and $|A|$ denotes the size of agent's action space. Considering
real-world games where the support size of a NE is small, we further propose a
new algorithm: MDP-Online Oracle Expert (MDP-OOE), that achieves a policy
regret bound of $\mathcal{O}(\sqrt{T\log(L)}+\tau^2\sqrt{ T k \log(k)})$ where
$k$ depends only on the support size of the NE. MDP-OOE leverages the key
benefit of Double Oracle in game theory and thus can solve games with
prohibitively large action space. Finally, to better understand the learning
dynamics of no-regret methods, under the same setting of no-external regret
adversary in OMDPs, we introduce an algorithm that achieves last-round
convergence result to a NE. To our best knowledge, this is first work leading
to the last iteration result in OMDPs.
Related papers
- Learning Adversarial Low-rank Markov Decision Processes with Unknown
Transition and Full-information Feedback [30.23951525723659]
We study the low-rank MDPs with adversarially changed losses in the full-information feedback setting.
We propose a policy optimization-based algorithm POLO, and we prove that it attains the $widetildeO(Kfrac56Afrac12dln (1+M)/ (1-gamma)2)$ regret guarantee.
arXiv Detail & Related papers (2023-11-14T03:12:43Z) - A Doubly Robust Approach to Sparse Reinforcement Learning [19.68978899041642]
We propose a new regret algorithm for episodic sparse linear Markov decision process (SMDP)
The proposed algorithm is $tildeO(sigma-1_min s_star H sqrtN)$, where $sigma_min$ denotes the restrictive minimum eigenvalue of the average Gram matrix of feature vectors.
arXiv Detail & Related papers (2023-10-23T18:52:17Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Eluder-based Regret for Stochastic Contextual MDPs [43.19667415823089]
We present the E-UC$3$RL algorithm for regret minimization in Contextual Markov Decision Processes (CMDPs)
Our algorithm is efficient (assuming efficient offline regression oracles) and enjoys a regret guarantee of $ widetildeO(H3 sqrtT |S| |A|d_mathrmE(mathcalP)$.
arXiv Detail & Related papers (2022-11-27T20:38:47Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Near-Optimal No-Regret Learning for General Convex Games [121.50979258049135]
We show that regret can be obtained for general convex and compact strategy sets.
Our dynamics are on an instantiation of optimistic follow-the-regularized-bounds over an appropriately emphlifted space.
Even in those special cases where prior results apply, our algorithm improves over the state-of-the-art regret.
arXiv Detail & Related papers (2022-06-17T12:58:58Z) - A Provably Efficient Algorithm for Linear Markov Decision Process with
Low Switching Cost [53.968049198926444]
We present the first algorithm for linear MDP with a low switching cost.
Our algorithm achieves an $widetildeOleft(sqrtd3H4Kright)$ regret bound with a near-optimal $Oleft(d Hlog Kright)$ global switching cost.
arXiv Detail & Related papers (2021-01-02T18:41:27Z) - Provably Efficient Reinforcement Learning for Discounted MDPs with
Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP)
We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret.
Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.