Related papers: Best-of-Both-Worlds Policy Optimization for CMDPs with Bandit Feedback

Best-of-Both-Worlds Policy Optimization for CMDPs with Bandit Feedback

URL: http://arxiv.org/abs/2410.02269v1
Date: Thu, 3 Oct 2024 07:44:40 GMT
Title: Best-of-Both-Worlds Policy Optimization for CMDPs with Bandit Feedback
Authors: Francesco Emanuele Stradi, Anna Lunghi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti,
Abstract summary: Stradi et al.(2024) proposed the first best-of-both-worlds algorithm for constrained Markov decision processes. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback. Our algorithm is based on a policy optimization approach, which is much more efficient than occupancy-measure-based methods.
Score: 34.7178680288326
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study online learning in constrained Markov decision processes (CMDPs) in which rewards and constraints may be either stochastic or adversarial. In such settings, Stradi et al.(2024) proposed the first best-of-both-worlds algorithm able to seamlessly handle stochastic and adversarial constraints, achieving optimal regret and constraint violation bounds in both cases. This algorithm suffers from two major drawbacks. First, it only works under full feedback, which severely limits its applicability in practice. Moreover, it relies on optimizing over the space of occupancy measures, which requires solving convex optimization problems, an highly inefficient task. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback. Specifically, when the constraints are stochastic, the algorithm achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation, while, when they are adversarial, it attains $\widetilde{\mathcal{O}}(\sqrt{T})$ constraint violation and a tight fraction of the optimal reward. Moreover, our algorithm is based on a policy optimization approach, which is much more efficient than occupancy-measure-based methods.

Related papers

Optimal Regret for Policy Optimization in Contextual Bandits [45.0314528751357]
We present the first high-probability optimal regret bound for a policy optimization technique applied to the problem of contextual multi-armed bandit (CMAB)<n>Our algorithm is both efficient and achieves an optimal regret bound of $widetildeO(sqrt K|mathcalA|log|mathcalF|)$, where $K$ is the number of rounds, $mathcalA$ is the set of arms, and $mathcalF$ is the function class used to approximate the losses.
arXiv Detail & Related papers (2026-02-14T09:51:24Z)
Efficient Computation of Blackwell Optimal Policies using Rational Functions [3.0529230554642752]
Decision Problems (MDPs) provide a foundational framework for modelling sequential decision-making across diverse domains.<n> discounted optimality may overly prioritise short-term rewards, while average optimality relies on strong structural assumptions.<n>Blackwell optimality addresses these challenges, offering a robust and comprehensive criterion that ensures optimality under both discounted and average reward frameworks.
arXiv Detail & Related papers (2025-08-25T17:41:30Z)
Safe and Efficient Online Convex Optimization with Linear Budget Constraints and Partial Feedback [3.5554907645160605]
This paper studies online convex optimization with unknown linear budget constraints. We propose a safe and efficient Lyapunov-optimization algorithm (SELO) that can achieve an $O(sqrtT)$ regret and zero cumulative constraint violation.
arXiv Detail & Related papers (2024-12-05T08:58:41Z)
Optimal Strong Regret and Violation in Constrained MDPs via Policy Optimization [37.24692425018]
We study online learning in emphconstrained MDPs (CMDPs) Our algorithm implements a primal-dual scheme that employs a state-of-the-art policy optimization approach for adversarial MDPs.
arXiv Detail & Related papers (2024-10-03T07:54:04Z)
Beyond Primal-Dual Methods in Bandits with Stochastic and Adversarial Constraints [29.514323697659613]
We address a generalization of the bandit with knapsacks problem, where a learner aims to maximize rewards while satisfying an arbitrary set of long-term constraints. Our goal is to design best-of-both-worlds algorithms that perform under both and adversarial constraints.
arXiv Detail & Related papers (2024-05-25T08:09:36Z)
Second Order Methods for Bandit Optimization and Control [34.51425758864638]
We show that our algorithm achieves optimal (in terms of terms of convex functions that we call $kappa$-2020) regret bounds for a large class of convex functions. We also investigate the adaptation of our second-order bandit algorithm to online convex optimization with memory.
arXiv Detail & Related papers (2024-02-14T04:03:38Z)
Zeroth and First Order Stochastic Frank-Wolfe Algorithms for Constrained Optimization [13.170519806372075]
Problems of convex optimization with two sets of constraints arise frequently in the context of semidefinite programming. Since projection onto the first set of constraints is difficult, it becomes necessary to explore projection-free algorithms. The efficacy of the proposed algorithms is tested on relevant applications of sparse matrix estimation, clustering via semidefinite relaxation, and uniform sparsest cut problem.
arXiv Detail & Related papers (2021-07-14T08:01:30Z)
Lower Bounds and Optimal Algorithms for Smooth and Strongly Convex Decentralized Optimization Over Time-Varying Networks [79.16773494166644]
We consider the task of minimizing the sum of smooth and strongly convex functions stored in a decentralized manner across the nodes of a communication network. We design two optimal algorithms that attain these lower bounds. We corroborate the theoretical efficiency of these algorithms by performing an experimental comparison with existing state-of-the-art methods.
arXiv Detail & Related papers (2021-06-08T15:54:44Z)
Minimax Optimization with Smooth Algorithmic Adversaries [59.47122537182611]
We propose a new algorithm for the min-player against smooth algorithms deployed by an adversary. Our algorithm is guaranteed to make monotonic progress having no limit cycles, and to find an appropriate number of gradient ascents.
arXiv Detail & Related papers (2021-06-02T22:03:36Z)
An Asymptotically Optimal Primal-Dual Incremental Algorithm for Contextual Linear Bandits [129.1029690825929]
We introduce a novel algorithm improving over the state-of-the-art along multiple dimensions. We establish minimax optimality for any learning horizon in the special case of non-contextual linear bandits.
arXiv Detail & Related papers (2020-10-23T09:12:47Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
Conservative Stochastic Optimization with Expectation Constraints [11.393603788068777]
This paper considers convex optimization problems where the objective and constraint functions involve expectations with respect to the data indices or environmental variables. Online and efficient approaches for solving such problems have not been widely studied. We propose a novel conservative optimization algorithm (CSOA) that achieves zero constraint violation and $Oleft(T-frac12right)$ optimality gap.
arXiv Detail & Related papers (2020-08-13T08:56:24Z)
A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic [142.1492359556374]
Bilevel optimization is a class of problems which exhibit a two-level structure. We propose a two-timescale approximation (TTSA) algorithm for tackling such a bilevel problem. We show that a two-timescale natural actor-critic policy optimization algorithm can be viewed as a special case of our TTSA framework.
arXiv Detail & Related papers (2020-07-10T05:20:02Z)
Private Stochastic Convex Optimization: Optimal Rates in Linear Time [74.47681868973598]
We study the problem of minimizing the population loss given i.i.d. samples from a distribution over convex loss functions. A recent work of Bassily et al. has established the optimal bound on the excess population loss achievable given $n$ samples. We describe two new techniques for deriving convex optimization algorithms both achieving the optimal bound on excess loss and using $O(minn, n2/d)$ gradient computations.
arXiv Detail & Related papers (2020-05-10T19:52:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.