Simultaneously Learning Stochastic and Adversarial Episodic MDPs with
Known Transition
- URL: http://arxiv.org/abs/2006.05606v2
- Date: Mon, 2 Nov 2020 07:21:00 GMT
- Title: Simultaneously Learning Stochastic and Adversarial Episodic MDPs with
Known Transition
- Authors: Tiancheng Jin, Haipeng Luo
- Abstract summary: We develop the first algorithm with a best-of-both-worlds'' guarantee.
It achieves $mathcalO(log T)$ regret when the losses are adversarial.
More generally, it achieves $tildemathcalO(sqrtC)$ regret in an intermediate setting.
- Score: 38.28925339231888
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work studies the problem of learning episodic Markov Decision Processes
with known transition and bandit feedback. We develop the first algorithm with
a ``best-of-both-worlds'' guarantee: it achieves $\mathcal{O}(log T)$ regret
when the losses are stochastic, and simultaneously enjoys worst-case robustness
with $\tilde{\mathcal{O}}(\sqrt{T})$ regret even when the losses are
adversarial, where $T$ is the number of episodes. More generally, it achieves
$\tilde{\mathcal{O}}(\sqrt{C})$ regret in an intermediate setting where the
losses are corrupted by a total amount of $C$. Our algorithm is based on the
Follow-the-Regularized-Leader method from Zimin and Neu (2013), with a novel
hybrid regularizer inspired by recent works of Zimmert et al. (2019a, 2019b)
for the special case of multi-armed bandits. Crucially, our regularizer admits
a non-diagonal Hessian with a highly complicated inverse. Analyzing such a
regularizer and deriving a particular self-bounding regret guarantee is our key
technical contribution and might be of independent interest.
Related papers
- No-Regret Online Reinforcement Learning with Adversarial Losses and
Transitions [39.864174754882754]
Existing online learning algorithms for adversarial Markov Decision Processes achieve $O(sqrtT)$ regret after $T$ rounds of interactions.
This is because it has been shown that adversarial transition functions make no-regret learning impossible.
We propose an algorithm that enjoys $widetildeO(sqrtT + CtextsfP)$ regret where $CtextsfP$ measures how adversarial the transition functions are.
arXiv Detail & Related papers (2023-05-27T06:10:17Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Weighted Tallying Bandits: Overcoming Intractability via Repeated
Exposure Optimality [46.445321251991096]
In recommender system or crowdsourcing applications, a human's preferences or abilities are often a function of the algorithm's recent actions.
We introduce the Weighted Tallying Bandit (WTB), which generalizes this setting by requiring that an action's loss is a function of a emph summation of the number of times that arm was played in the last $m$ timesteps.
We show that for problems with $K$ actions and horizon $T$, a simple modification of the successive elimination algorithm has $O left( sqrtKT +m+
arXiv Detail & Related papers (2023-05-04T15:59:43Z) - On Submodular Contextual Bandits [92.45432756301231]
We consider the problem of contextual bandits where actions are subsets of a ground set and mean rewards are modeled by an unknown monotone submodular function.
We show that our algorithm efficiently randomizes around local optima of estimated functions according to the Inverse Gap Weighting strategy.
arXiv Detail & Related papers (2021-12-03T21:42:33Z) - The best of both worlds: stochastic and adversarial episodic MDPs with
unknown transition [49.78053380710322]
We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through $T$ episodes.
Recent work by [Jin and Luo, 2020] achieves this goal when the fixed transition is known.
In this work, we resolve this open problem by using the same Follow-the-Regularized-Leader ($textFTRL$) framework together with a set of new techniques.
arXiv Detail & Related papers (2021-06-08T05:46:35Z) - Finding the Stochastic Shortest Path with Low Regret: The Adversarial
Cost and Unknown Transition Case [29.99619764839178]
We make significant progress toward the shortest path problem with adversarial costs and unknown transition.
Specifically, we develop algorithms that achieve $widetildeO(sqrtS3A2DT_star K)$ regret for the full-information setting.
We are also the first to consider the most challenging combination: bandit feedback with adversarial costs and unknown transition.
arXiv Detail & Related papers (2021-02-10T06:33:04Z) - Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z) - Minimax Regret for Stochastic Shortest Path with Adversarial Costs and
Known Transition [37.6975819766632]
We study the shortest path problem with adversarial costs and known transition.
We show that the minimax regret is $widetildeO(sqrtDTstar K)$ and $widetildeO(sqrtDTstar SA K)$ for the full-information setting and the bandit feedback setting.
arXiv Detail & Related papers (2020-12-07T20:55:28Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.