Related papers: Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition

Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition

URL: http://arxiv.org/abs/2006.05606v2
Date: Mon, 2 Nov 2020 07:21:00 GMT
Title: Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition
Authors: Tiancheng Jin, Haipeng Luo
Abstract summary: We develop the first algorithm with a best-of-both-worlds'' guarantee. It achieves $mathcalO(log T)$ regret when the losses are adversarial. More generally, it achieves $tildemathcalO(sqrtC)$ regret in an intermediate setting.
Score: 38.28925339231888
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work studies the problem of learning episodic Markov Decision Processes with known transition and bandit feedback. We develop the first algorithm with a ``best-of-both-worlds'' guarantee: it achieves $\mathcal{O}(log T)$ regret when the losses are stochastic, and simultaneously enjoys worst-case robustness with $\tilde{\mathcal{O}}(\sqrt{T})$ regret even when the losses are adversarial, where $T$ is the number of episodes. More generally, it achieves $\tilde{\mathcal{O}}(\sqrt{C})$ regret in an intermediate setting where the losses are corrupted by a total amount of $C$. Our algorithm is based on the Follow-the-Regularized-Leader method from Zimin and Neu (2013), with a novel hybrid regularizer inspired by recent works of Zimmert et al. (2019a, 2019b) for the special case of multi-armed bandits. Crucially, our regularizer admits a non-diagonal Hessian with a highly complicated inverse. Analyzing such a regularizer and deriving a particular self-bounding regret guarantee is our key technical contribution and might be of independent interest.

Related papers

Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback [49.84060509296641]
We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit) Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory. We introduce the first Policy Optimization algorithms for this setting.
arXiv Detail & Related papers (2025-02-06T12:03:24Z)
No-Regret Online Reinforcement Learning with Adversarial Losses and Transitions [39.864174754882754]
Existing online learning algorithms for adversarial Markov Decision Processes achieve $O(sqrtT)$ regret after $T$ rounds of interactions. This is because it has been shown that adversarial transition functions make no-regret learning impossible. We propose an algorithm that enjoys $widetildeO(sqrtT + CtextsfP)$ regret where $CtextsfP$ measures how adversarial the transition functions are.
arXiv Detail & Related papers (2023-05-27T06:10:17Z)
Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z)
Weighted Tallying Bandits: Overcoming Intractability via Repeated Exposure Optimality [46.445321251991096]
In recommender system or crowdsourcing applications, a human's preferences or abilities are often a function of the algorithm's recent actions. We introduce the Weighted Tallying Bandit (WTB), which generalizes this setting by requiring that an action's loss is a function of a emph summation of the number of times that arm was played in the last $m$ timesteps. We show that for problems with $K$ actions and horizon $T$, a simple modification of the successive elimination algorithm has $O left( sqrtKT +m+
arXiv Detail & Related papers (2023-05-04T15:59:43Z)
On Submodular Contextual Bandits [92.45432756301231]
We consider the problem of contextual bandits where actions are subsets of a ground set and mean rewards are modeled by an unknown monotone submodular function. We show that our algorithm efficiently randomizes around local optima of estimated functions according to the Inverse Gap Weighting strategy.
arXiv Detail & Related papers (2021-12-03T21:42:33Z)
The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition [49.78053380710322]
We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through $T$ episodes. Recent work by [Jin and Luo, 2020] achieves this goal when the fixed transition is known. In this work, we resolve this open problem by using the same Follow-the-Regularized-Leader ($textFTRL$) framework together with a set of new techniques.
arXiv Detail & Related papers (2021-06-08T05:46:35Z)
Finding the Stochastic Shortest Path with Low Regret: The Adversarial Cost and Unknown Transition Case [29.99619764839178]
We make significant progress toward the shortest path problem with adversarial costs and unknown transition. Specifically, we develop algorithms that achieve $widetildeO(sqrtS3A2DT_star K)$ regret for the full-information setting. We are also the first to consider the most challenging combination: bandit feedback with adversarial costs and unknown transition.
arXiv Detail & Related papers (2021-02-10T06:33:04Z)
Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics. In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback. Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z)
Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition [37.6975819766632]
We study the shortest path problem with adversarial costs and known transition. We show that the minimax regret is $widetildeO(sqrtDTstar K)$ and $widetildeO(sqrtDTstar SA K)$ for the full-information setting and the bandit feedback setting.
arXiv Detail & Related papers (2020-12-07T20:55:28Z)
Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs) We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model. Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.