Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes
with Bandit Feedback
- URL: http://arxiv.org/abs/2205.13451v1
- Date: Thu, 26 May 2022 15:55:50 GMT
- Title: Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes
with Bandit Feedback
- Authors: Yan Dai, Haipeng Luo and Liyu Chen
- Abstract summary: We consider regret for Adversarial Markov Decision Processes (AMDPs) where the loss functions are changing over time and adversarially chosen.
While there has been a surge of studies on this problem using Online-Mirror-Descent (OMD) methods, very little is known about the Follow-the-Perturbed-Leader (FTPL) methods.
We develop the first no-regret algorithm for learning AMDPs in the infinite-horizon setting with bandit feedback and transitions.
- Score: 35.687473978249535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider regret minimization for Adversarial Markov Decision Processes
(AMDPs), where the loss functions are changing over time and adversarially
chosen, and the learner only observes the losses for the visited state-action
pairs (i.e., bandit feedback). While there has been a surge of studies on this
problem using Online-Mirror-Descent (OMD) methods, very little is known about
the Follow-the-Perturbed-Leader (FTPL) methods, which are usually
computationally more efficient and also easier to implement since it only
requires solving an offline planning problem. Motivated by this, we take a
closer look at FTPL for learning AMDPs, starting from the standard episodic
finite-horizon setting. We find some unique and intriguing difficulties in the
analysis and propose a workaround to eventually show that FTPL is also able to
achieve near-optimal regret bounds in this case. More importantly, we then find
two significant applications: First, the analysis of FTPL turns out to be
readily generalizable to delayed bandit feedback with order-optimal regret,
while OMD methods exhibit extra difficulties (Jin et al., 2022). Second, using
FTPL, we also develop the first no-regret algorithm for learning communicating
AMDPs in the infinite-horizon setting with bandit feedback and stochastic
transitions. Our algorithm is efficient assuming access to an offline planning
oracle, while even for the easier full-information setting, the only existing
algorithm (Chandrasekaran and Tewari, 2021) is computationally inefficient.
Related papers
- Optimism in the Face of Ambiguity Principle for Multi-Armed Bandits [6.7310264583128445]
Follow-The-Regularized-Leader (FTRL) algorithms often enjoy optimal regret for adversarial as well as bandit problems.
We propose a new FTPL algorithm that generates optimal policies for both adversarial and multi-armed bandits.
arXiv Detail & Related papers (2024-09-30T16:00:23Z) - Distributed Online Bandit Nonconvex Optimization with One-Point Residual Feedback via Dynamic Regret [10.700891331004799]
This paper considers the distributed online bandit optimization problem with non loss functions over a time-varying digraph.
Players select an adversary and then the adversary assigns an arbitrary non-linear loss function to this player.
The expected regret of our algorithms is comparable to existing algorithms that use two-point deviation.
arXiv Detail & Related papers (2024-09-24T02:37:33Z) - Tractable Offline Learning of Regular Decision Processes [50.11277112628193]
This work studies offline Reinforcement Learning (RL) in a class of non-Markovian environments called Regular Decision Processes (RDPs)
Ins, the unknown dependency of future observations and rewards from the past interactions can be captured experimentally.
Many algorithms first reconstruct this unknown dependency using automata learning techniques.
arXiv Detail & Related papers (2024-09-04T14:26:58Z) - Truly No-Regret Learning in Constrained MDPs [61.78619476991494]
We propose a model-based primal-dual algorithm to learn in an unknown CMDP.
We prove that our algorithm achieves sublinear regret without error cancellations.
arXiv Detail & Related papers (2024-02-24T09:47:46Z) - No-Regret Reinforcement Learning in Smooth MDPs [24.249446550171307]
We introduce a novel structural assumption on the decision processes (MDPs) that generalizes most of the settings proposed so far.
We propose two algorithms for regret minimization in $nu-$smoothness.
We compare our results with state-of-the-art ones from RL theory, showing that our algorithms achieve the best guarantees.
arXiv Detail & Related papers (2024-02-06T08:18:14Z) - Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z) - Regret minimization in stochastic non-convex learning via a
proximal-gradient approach [80.59047515124198]
Motivated by applications in machine learning and operations, we regret with first-order oracle feedback minimization online constrained problems.
We develop a new prox-grad with guarantees proximal complexity reduction techniques.
arXiv Detail & Related papers (2020-10-13T09:22:21Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Refined Analysis of FPL for Adversarial Markov Decision Processes [9.188318506016897]
Follow-the-PerturbedLeader (FPL) based algorithms have been proposed in previous literature.
We improve the analysis of FPL based algorithms in both settings, matching the current best regret bounds using faster and simpler algorithms.
arXiv Detail & Related papers (2020-08-21T01:12:10Z) - More Practical and Adaptive Algorithms for Online Quantum State Learning [11.836183463815653]
In this paper, we develop algorithms to advance the online learning of quantum states.
First, we show that Regularized Follow-the-Leader (RFTL) method with Tallis-2 entropy can achieve an $O(sqrtMT)$ total loss with perfect hindsight.
Second, we propose a parameter-free algorithm based on a classical adjusting learning rate schedule.
arXiv Detail & Related papers (2020-06-01T15:17:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.