No-Regret Online Reinforcement Learning with Adversarial Losses and
Transitions
- URL: http://arxiv.org/abs/2305.17380v3
- Date: Thu, 26 Oct 2023 17:12:30 GMT
- Title: No-Regret Online Reinforcement Learning with Adversarial Losses and
Transitions
- Authors: Tiancheng Jin, Junyan Liu, Chlo\'e Rouyer, William Chang, Chen-Yu Wei,
Haipeng Luo
- Abstract summary: Existing online learning algorithms for adversarial Markov Decision Processes achieve $O(sqrtT)$ regret after $T$ rounds of interactions.
This is because it has been shown that adversarial transition functions make no-regret learning impossible.
We propose an algorithm that enjoys $widetildeO(sqrtT + CtextsfP)$ regret where $CtextsfP$ measures how adversarial the transition functions are.
- Score: 39.864174754882754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing online learning algorithms for adversarial Markov Decision Processes
achieve ${O}(\sqrt{T})$ regret after $T$ rounds of interactions even if the
loss functions are chosen arbitrarily by an adversary, with the caveat that the
transition function has to be fixed. This is because it has been shown that
adversarial transition functions make no-regret learning impossible. Despite
such impossibility results, in this work, we develop algorithms that can handle
both adversarial losses and adversarial transitions, with regret increasing
smoothly in the degree of maliciousness of the adversary. More concretely, we
first propose an algorithm that enjoys $\widetilde{{O}}(\sqrt{T} +
C^{\textsf{P}})$ regret where $C^{\textsf{P}}$ measures how adversarial the
transition functions are and can be at most ${O}(T)$. While this algorithm
itself requires knowledge of $C^{\textsf{P}}$, we further develop a black-box
reduction approach that removes this requirement. Moreover, we also show that
further refinements of the algorithm not only maintains the same regret bound,
but also simultaneously adapts to easier environments (where losses are
generated in a certain stochastically constrained manner as in Jin et al.
[2021]) and achieves $\widetilde{{O}}(U + \sqrt{UC^{\textsf{L}}} +
C^{\textsf{P}})$ regret, where $U$ is some standard gap-dependent coefficient
and $C^{\textsf{L}}$ is the amount of corruption on losses.
Related papers
- Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback [30.337826346496385]
We study online reinforcement learning in linear Markov decision processes with adversarial losses and bandit feedback.
We introduce two algorithms that achieve improved regret performance compared to existing approaches.
arXiv Detail & Related papers (2023-10-17T19:43:37Z) - Online Learning in Dynamically Changing Environments [11.731001328350983]
We study the problem of online learning and online regret when samples are drawn from a general unknown non-stationary process.
We prove a tight (upto $sqrtlog T$ factor) bound $O(sqrtKTcdotmathsfVC(mathcalH)log T)$ for the expected worst case regret of any finite VC-dimensional class.
We extend these results to general smooth adversary processes with unknown reference measure.
arXiv Detail & Related papers (2023-01-31T21:10:03Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear
Contextual Bandits and Markov Decision Processes [59.61248760134937]
We propose an efficient algorithm to achieve a regret of $tildeO(sqrtT+zeta)$.
The proposed algorithm relies on the recently developed uncertainty-weighted least-squares regression from linear contextual bandit.
We generalize our algorithm to the episodic MDP setting and first achieve an additive dependence on the corruption level $zeta$.
arXiv Detail & Related papers (2022-12-12T15:04:56Z) - Logarithmic Regret from Sublinear Hints [76.87432703516942]
We show that an algorithm can obtain $O(log T)$ regret with just $O(sqrtT)$ hints under a natural query model.
We also show that $o(sqrtT)$ hints cannot guarantee better than $Omega(sqrtT)$ regret.
arXiv Detail & Related papers (2021-11-09T16:50:18Z) - The best of both worlds: stochastic and adversarial episodic MDPs with
unknown transition [49.78053380710322]
We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through $T$ episodes.
Recent work by [Jin and Luo, 2020] achieves this goal when the fixed transition is known.
In this work, we resolve this open problem by using the same Follow-the-Regularized-Leader ($textFTRL$) framework together with a set of new techniques.
arXiv Detail & Related papers (2021-06-08T05:46:35Z) - Online Convex Optimization with Continuous Switching Constraint [78.25064451417082]
We introduce the problem of online convex optimization with continuous switching constraint.
We show that, for strongly convex functions, the regret bound can be improved to $O(log T)$ for $S=Omega(log T)$, and $O(minT/exp(S)+S,T)$ for $S=O(log T)$.
arXiv Detail & Related papers (2021-03-21T11:43:35Z) - Projection-free Distributed Online Learning with Strongly Convex Losses [37.08975118221237]
We exploit the strong convexity of loss functions to improve the regret bound and communication complexity.
Our algorithm is nearly optimal for obtaining the $O(T2/3log T)$ regret bound up to polylogarithmic factors.
arXiv Detail & Related papers (2021-03-20T05:38:51Z) - Simultaneously Learning Stochastic and Adversarial Episodic MDPs with
Known Transition [38.28925339231888]
We develop the first algorithm with a best-of-both-worlds'' guarantee.
It achieves $mathcalO(log T)$ regret when the losses are adversarial.
More generally, it achieves $tildemathcalO(sqrtC)$ regret in an intermediate setting.
arXiv Detail & Related papers (2020-06-10T01:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.