Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback
- URL: http://arxiv.org/abs/2201.13172v1
- Date: Mon, 31 Jan 2022 12:34:26 GMT
- Title: Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback
- Authors: Tiancheng Jin and Tal Lancewicki and Haipeng Luo and Yishay Mansour
and Aviv Rosenberg
- Abstract summary: We study online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback.
We present the first algorithms that achieve near-optimal $sqrtK + D$ regret, where $K$ is the number of episodes and $D = sum_k=1K dk$ is the total delay.
- Score: 67.63049551992816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The standard assumption in reinforcement learning (RL) is that agents observe
feedback for their actions immediately. However, in practice feedback is often
observed in delay. This paper studies online learning in episodic Markov
decision process (MDP) with unknown transitions, adversarially changing costs,
and unrestricted delayed bandit feedback. More precisely, the feedback for the
agent in episode $k$ is revealed only in the end of episode $k + d^k$, where
the delay $d^k$ can be changing over episodes and chosen by an oblivious
adversary. We present the first algorithms that achieve near-optimal $\sqrt{K +
D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}^K d^k$ is
the total delay, significantly improving upon the best known regret bound of
$(K + D)^{2/3}$.
Related papers
- Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback [49.84060509296641]
We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit)
Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory.
We introduce the first Policy Optimization algorithms for this setting.
arXiv Detail & Related papers (2025-02-06T12:03:24Z) - Reinforcement Learning with Segment Feedback [56.54271464134885]
We consider a model named RL with segment feedback, which offers a general paradigm filling the gap between per-state-action feedback and trajectory feedback.
Under binary feedback, increasing the number of segments $m$ decreases the regret at an exponential rate; in contrast, surprisingly, under sum feedback, increasing $m$ does not reduce the regret significantly.
arXiv Detail & Related papers (2025-02-03T23:08:42Z) - Delay as Payoff in MAB [40.65658965074464]
We investigate a variant of the classical Multi-armed Bandit (MAB) problem, where the payoff received by an agent is both delayed, and directly corresponds to the magnitude of the delay.
Our main contributions are tight upper and lower bounds for both the cost and reward settings.
Our regret bounds highlight the difference between the cost and reward scenarios, showing that the improvement in the cost scenario is more significant than for the reward.
arXiv Detail & Related papers (2024-08-27T15:52:52Z) - Adversarial Bandits with Multi-User Delayed Feedback: Theory and
Application [17.64363983613468]
We formulate an adversarial MAB problem with multi-user delayed feedback and design a modified EXP3 algorithm MUD-EXP3.
In this paper, we consider that the delayed feedback results are from multiple users and are unrestricted on internal distribution.
arXiv Detail & Related papers (2023-10-17T12:08:15Z) - The best of both worlds: stochastic and adversarial episodic MDPs with
unknown transition [49.78053380710322]
We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through $T$ episodes.
Recent work by [Jin and Luo, 2020] achieves this goal when the fixed transition is known.
In this work, we resolve this open problem by using the same Follow-the-Regularized-Leader ($textFTRL$) framework together with a set of new techniques.
arXiv Detail & Related papers (2021-06-08T05:46:35Z) - Learning Adversarial Markov Decision Processes with Delayed Feedback [45.86354980347581]
We consider online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback.
We present novel algorithms that achieve near-optimal high-probability regret of $widetilde O ( sqrtK + sqrtD )$ under full-information feedback.
arXiv Detail & Related papers (2020-12-29T16:47:42Z) - Minimax Regret for Stochastic Shortest Path with Adversarial Costs and
Known Transition [37.6975819766632]
We study the shortest path problem with adversarial costs and known transition.
We show that the minimax regret is $widetildeO(sqrtDTstar K)$ and $widetildeO(sqrtDTstar SA K)$ for the full-information setting and the bandit feedback setting.
arXiv Detail & Related papers (2020-12-07T20:55:28Z) - Near-optimal Regret Bounds for Stochastic Shortest Path [63.029132134792555]
shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost.
We show that any learning algorithm must have at least $Omega(B_star sqrt|S| |A| K)$ regret in the worst case.
arXiv Detail & Related papers (2020-02-23T09:10:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.