Related papers: Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback

Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback

URL: http://arxiv.org/abs/2502.04004v1
Date: Thu, 06 Feb 2025 12:03:24 GMT
Title: Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback
Authors: Tal Lancewicki, Yishay Mansour,
Abstract summary: We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit)<n>Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory.<n>We introduce the first Policy Optimization algorithms for this setting.
Score: 49.84060509296641
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit). Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory. We introduce the first Policy Optimization algorithms for this setting. In the known-dynamics case, we achieve the first \textit{optimal} regret bound of $\tilde \Theta(H^2\sqrt{SAK})$, where $K$ is the number of episodes, $H$ is the episode horizon, $S$ is the number of states, and $A$ is the number of actions. In the unknown dynamics case we establish regret bound of $\tilde O(H^3 S \sqrt{AK})$, significantly improving the best known result by a factor of $H^2 S^5 A^2$.

Related papers

Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback [58.66941279460248]
Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM) We study a model within this domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary. We propose an algorithm namely robust contextual dueling bandits (RCDB), which is based on uncertainty-weighted maximum likelihood estimation.
arXiv Detail & Related papers (2024-04-16T17:59:55Z)
Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses. We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z)
Context-lumpable stochastic bandits [49.024050919419366]
We consider a contextual bandit problem with $S$ contexts and $K$ actions. We give an algorithm that outputs an $epsilon$-optimal policy after using at most $widetilde O(r (S +K )/epsilon2)$ samples. In the regret setting, we give an algorithm whose cumulative regret up to time $T$ is bounded by $widetilde O(sqrtr3(S+K)T)$.
arXiv Detail & Related papers (2023-06-22T17:20:30Z)
Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$. We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes. We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z)
Double Thompson Sampling in Finite stochastic Games [10.559241770674724]
We consider the trade-off problem between exploration and exploitation under finite discounted Markov Decision Process. We propose a double Thompson sampling reinforcement learning algorithm(DTS) to solve this kind of problem.
arXiv Detail & Related papers (2022-02-21T06:11:51Z)
Gap-Dependent Unsupervised Exploration for Reinforcement Learning [40.990467706237396]
We present an efficient algorithm for task-agnostic reinforcement learning. The algorithm takes only $widetildemathcalO (1/epsilon cdot (H3SA / rho + H4 S2 A) )$ episodes of exploration. We show that, information-theoretically, this bound is nearly tight for $rho Theta (1/(HS))$ and $H>1$.
arXiv Detail & Related papers (2021-08-11T20:42:46Z)
Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$. We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z)
Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state. Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z)
Finding the Stochastic Shortest Path with Low Regret: The Adversarial Cost and Unknown Transition Case [29.99619764839178]
We make significant progress toward the shortest path problem with adversarial costs and unknown transition. Specifically, we develop algorithms that achieve $widetildeO(sqrtS3A2DT_star K)$ regret for the full-information setting. We are also the first to consider the most challenging combination: bandit feedback with adversarial costs and unknown transition.
arXiv Detail & Related papers (2021-02-10T06:33:04Z)
Lazy OCO: Online Convex Optimization on a Switching Budget [34.936641201844054]
We study a variant of online convex optimization where the player is permitted to switch decisions at most $S$ times in expectation throughout $T$ rounds. Similar problems have been addressed in prior work for the discrete decision set setting, and more recently in the continuous setting but only with an adaptive adversary.
arXiv Detail & Related papers (2021-02-07T14:47:19Z)
Learning Adversarial Markov Decision Processes with Delayed Feedback [45.86354980347581]
We consider online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. We present novel algorithms that achieve near-optimal high-probability regret of $widetilde O ( sqrtK + sqrtD )$ under full-information feedback.
arXiv Detail & Related papers (2020-12-29T16:47:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.