Related papers: Learning Adversarial Markov Decision Processes with Delayed Feedback

Learning Adversarial Markov Decision Processes with Delayed Feedback

URL: http://arxiv.org/abs/2012.14843v2
Date: Fri, 29 Jan 2021 13:10:46 GMT
Title: Learning Adversarial Markov Decision Processes with Delayed Feedback
Authors: Tal Lancewicki and Aviv Rosenberg and Yishay Mansour
Abstract summary: We consider online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. We present novel algorithms that achieve near-optimal high-probability regret of $widetilde O ( sqrtK + sqrtD )$ under full-information feedback.
Score: 45.86354980347581
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning typically assumes that the agent observes feedback from the environment immediately, but in many real-world applications (like recommendation systems) the feedback is observed in delay. Thus, we consider online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. That is, the costs and trajectory of episode $k$ are only available at the end of episode $k + d^k$, where the delays $d^k$ are neither identical nor bounded, and are chosen by an adversary. We present novel algorithms based on policy optimization that achieve near-optimal high-probability regret of $\widetilde O ( \sqrt{K} + \sqrt{D} )$ under full-information feedback, where $K$ is the number of episodes and $D = \sum_{k} d^k$ is the total delay. Under bandit feedback, we prove similar $\widetilde O ( \sqrt{K} + \sqrt{D} )$ regret assuming that the costs are stochastic, and $\widetilde O ( K^{2/3} + D^{2/3} )$ regret in the general case. To our knowledge, we are the first to consider the important setting of delayed feedback in adversarial MDPs.

Related papers

Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback [49.84060509296641]
We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit) Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory. We introduce the first Policy Optimization algorithms for this setting.
arXiv Detail & Related papers (2025-02-06T12:03:24Z)
Delay as Payoff in MAB [40.65658965074464]
We investigate a variant of the classical Multi-armed Bandit (MAB) problem, where the payoff received by an agent is both delayed, and directly corresponds to the magnitude of the delay. Our main contributions are tight upper and lower bounds for both the cost and reward settings. Our regret bounds highlight the difference between the cost and reward scenarios, showing that the improvement in the cost scenario is more significant than for the reward.
arXiv Detail & Related papers (2024-08-27T15:52:52Z)
Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback [67.63049551992816]
We study online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. We present the first algorithms that achieve near-optimal $sqrtK + D$ regret, where $K$ is the number of episodes and $D = sum_k=1K dk$ is the total delay.
arXiv Detail & Related papers (2022-01-31T12:34:26Z)
The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition [49.78053380710322]
We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through $T$ episodes. Recent work by [Jin and Luo, 2020] achieves this goal when the fixed transition is known. In this work, we resolve this open problem by using the same Follow-the-Regularized-Leader ($textFTRL$) framework together with a set of new techniques.
arXiv Detail & Related papers (2021-06-08T05:46:35Z)
Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state. Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z)
Finding the Stochastic Shortest Path with Low Regret: The Adversarial Cost and Unknown Transition Case [29.99619764839178]
We make significant progress toward the shortest path problem with adversarial costs and unknown transition. Specifically, we develop algorithms that achieve $widetildeO(sqrtS3A2DT_star K)$ regret for the full-information setting. We are also the first to consider the most challenging combination: bandit feedback with adversarial costs and unknown transition.
arXiv Detail & Related papers (2021-02-10T06:33:04Z)
Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition [37.6975819766632]
We study the shortest path problem with adversarial costs and known transition. We show that the minimax regret is $widetildeO(sqrtDTstar K)$ and $widetildeO(sqrtDTstar SA K)$ for the full-information setting and the bandit feedback setting.
arXiv Detail & Related papers (2020-12-07T20:55:28Z)
Stochastic Shortest Path with Adversarially Changing Costs [57.90236104782219]
shortest path (SSP) is a well-known problem in planning and control. We present the adversarial SSP model that also accounts for adversarial changes in the costs over time. We are the first to consider this natural setting of adversarial SSP and obtain sub-linear regret for it.
arXiv Detail & Related papers (2020-06-20T12:10:35Z)
Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition [38.28925339231888]
We develop the first algorithm with a best-of-both-worlds'' guarantee. It achieves $mathcalO(log T)$ regret when the losses are adversarial. More generally, it achieves $tildemathcalO(sqrtC)$ regret in an intermediate setting.
arXiv Detail & Related papers (2020-06-10T01:59:34Z)
Near-optimal Regret Bounds for Stochastic Shortest Path [63.029132134792555]
shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. We show that any learning algorithm must have at least $Omega(B_star sqrt|S| |A| K)$ regret in the worst case.
arXiv Detail & Related papers (2020-02-23T09:10:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.