Related papers: Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback

Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback

URL: http://arxiv.org/abs/2205.13451v1
Date: Thu, 26 May 2022 15:55:50 GMT
Title: Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback
Authors: Yan Dai, Haipeng Luo and Liyu Chen
Abstract summary: We consider regret for Adversarial Markov Decision Processes (AMDPs) where the loss functions are changing over time and adversarially chosen. While there has been a surge of studies on this problem using Online-Mirror-Descent (OMD) methods, very little is known about the Follow-the-Perturbed-Leader (FTPL) methods. We develop the first no-regret algorithm for learning AMDPs in the infinite-horizon setting with bandit feedback and transitions.
Score: 35.687473978249535
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider regret minimization for Adversarial Markov Decision Processes (AMDPs), where the loss functions are changing over time and adversarially chosen, and the learner only observes the losses for the visited state-action pairs (i.e., bandit feedback). While there has been a surge of studies on this problem using Online-Mirror-Descent (OMD) methods, very little is known about the Follow-the-Perturbed-Leader (FTPL) methods, which are usually computationally more efficient and also easier to implement since it only requires solving an offline planning problem. Motivated by this, we take a closer look at FTPL for learning AMDPs, starting from the standard episodic finite-horizon setting. We find some unique and intriguing difficulties in the analysis and propose a workaround to eventually show that FTPL is also able to achieve near-optimal regret bounds in this case. More importantly, we then find two significant applications: First, the analysis of FTPL turns out to be readily generalizable to delayed bandit feedback with order-optimal regret, while OMD methods exhibit extra difficulties (Jin et al., 2022). Second, using FTPL, we also develop the first no-regret algorithm for learning communicating AMDPs in the infinite-horizon setting with bandit feedback and stochastic transitions. Our algorithm is efficient assuming access to an offline planning oracle, while even for the easier full-information setting, the only existing algorithm (Chandrasekaran and Tewari, 2021) is computationally inefficient.

Related papers

Towards An Unsupervised Learning Scheme for Efficiently Solving Parameterized Mixed-Integer Programs [6.1860817947800655]
We train an autoencoder for binary variables in an unsupervised learning fashion. We present a strategy to construct a class of cutting plane constraints from the decoder parameters of an offline-trained AE. Their integration into the primal MIP problem leads to a tightened MIP with the reduced feasible region.
arXiv Detail & Related papers (2024-12-23T14:48:32Z)
Regret-Free Reinforcement Learning for LTL Specifications [6.342676126028222]
Reinforcement learning is a promising method to learn optimal control policies for systems with unknown dynamics. Current RL-based methods offer only guarantees, which provide no insight into the transient performance during the learning phase. We present the first regret-free online algorithm for learning a controller that addresses the general class of specifications over Markov decision processes.
arXiv Detail & Related papers (2024-11-18T20:01:45Z)
Optimism in the Face of Ambiguity Principle for Multi-Armed Bandits [6.7310264583128445]
Follow-The-Regularized-Leader (FTRL) algorithms often enjoy optimal regret for adversarial as well as bandit problems. We propose a new FTPL algorithm that generates optimal policies for both adversarial and multi-armed bandits.
arXiv Detail & Related papers (2024-09-30T16:00:23Z)
Distributed Online Bandit Nonconvex Optimization with One-Point Residual Feedback via Dynamic Regret [10.700891331004799]
This paper considers the distributed online bandit optimization problem with non loss functions over a time-varying digraph. Players select an adversary and then the adversary assigns an arbitrary non-linear loss function to this player. The expected regret of our algorithms is comparable to existing algorithms that use two-point deviation.
arXiv Detail & Related papers (2024-09-24T02:37:33Z)
Tractable Offline Learning of Regular Decision Processes [50.11277112628193]
This work studies offline Reinforcement Learning (RL) in a class of non-Markovian environments called Regular Decision Processes (RDPs) Ins, the unknown dependency of future observations and rewards from the past interactions can be captured experimentally. Many algorithms first reconstruct this unknown dependency using automata learning techniques.
arXiv Detail & Related papers (2024-09-04T14:26:58Z)
Infrequent Resolving Algorithm for Online Linear Programming [3.247415064064425]
Existing online linear programming (OLP) algorithms fall into two categories: LP-based algorithms and LP-free algorithms. We propose a well-performing algorithm, that solves LPs at a few selected time points and conducts first-order computations at other time points. Our work highlights the value of resolving both at the beginning and the end of the selling horizon, and provides a novel framework to prove the performance guarantee.
arXiv Detail & Related papers (2024-08-01T11:09:01Z)
Truly No-Regret Learning in Constrained MDPs [61.78619476991494]
We propose a model-based primal-dual algorithm to learn in an unknown CMDP. We prove that our algorithm achieves sublinear regret without error cancellations.
arXiv Detail & Related papers (2024-02-24T09:47:46Z)
No-Regret Reinforcement Learning in Smooth MDPs [24.249446550171307]
We introduce a novel structural assumption on the decision processes (MDPs) that generalizes most of the settings proposed so far. We propose two algorithms for regret minimization in $nu-$smoothness. We compare our results with state-of-the-art ones from RL theory, showing that our algorithms achieve the best guarantees.
arXiv Detail & Related papers (2024-02-06T08:18:14Z)
Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics. In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback. Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z)
Regret minimization in stochastic non-convex learning via a proximal-gradient approach [80.59047515124198]
Motivated by applications in machine learning and operations, we regret with first-order oracle feedback minimization online constrained problems. We develop a new prox-grad with guarantees proximal complexity reduction techniques.
arXiv Detail & Related papers (2020-10-13T09:22:21Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
Refined Analysis of FPL for Adversarial Markov Decision Processes [9.188318506016897]
Follow-the-PerturbedLeader (FPL) based algorithms have been proposed in previous literature. We improve the analysis of FPL based algorithms in both settings, matching the current best regret bounds using faster and simpler algorithms.
arXiv Detail & Related papers (2020-08-21T01:12:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.