Online Learning in MDPs with Partially Adversarial Transitions and Losses
- URL: http://arxiv.org/abs/2602.09474v1
- Date: Tue, 10 Feb 2026 07:13:11 GMT
- Title: Online Learning in MDPs with Partially Adversarial Transitions and Losses
- Authors: Ofir Schlisselberg, Tal Lancewicki, Yishay Mansour,
- Abstract summary: We study reinforcement learning in MDPs whose transition function is at most steps but may behave adversarially at a fixed subset of $$ steps per episode.<n>We introduce emphconditioned occupancy measures, which remain stable across episodes even with adversarial transitions.<n>We characterize the regret of adversarial MDPs in the emphfully adversarial setting ($=H-1$) both for full-information and bandit feedback.
- Score: 42.59565593884281
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study reinforcement learning in MDPs whose transition function is stochastic at most steps but may behave adversarially at a fixed subset of $Λ$ steps per episode. This model captures environments that are stable except at a few vulnerable points. We introduce \emph{conditioned occupancy measures}, which remain stable across episodes even with adversarial transitions, and use them to design two algorithms. The first handles arbitrary adversarial steps and achieves regret $\tilde{O}(H S^Λ\sqrt{K S A^{Λ+1}})$, where $K$ is the number of episodes, $S$ is the number of state, $A$ is the number of actions and $H$ is the episode's horizon. The second, assuming the adversarial steps are consecutive, improves the dependence on $S$ to $\tilde{O}(H\sqrt{K S^{3} A^{Λ+1}})$. We further give a $K^{2/3}$-regret reduction that removes the need to know which steps are the $Λ$ adversarial steps. We also characterize the regret of adversarial MDPs in the \emph{fully adversarial} setting ($Λ=H-1$) both for full-information and bandit feedback, and provide almost matching upper and lower bounds (slightly strengthen existing lower bounds, and clarify how different feedback structures affect the hardness of learning).
Related papers
- Reinforcement Learning from Adversarial Preferences in Tabular MDPs [62.73758165845971]
We introduce a new framework of episodic Markov decision processes (MDPs) with adversarial preferences.<n>Unlike standard episodic MDPs with adversarial losses, in PbMDPs the learner instead observes preferences between two candidate arms.<n>We develop algorithms that achieve a regret bound of order $T2/3$ under known transitions.
arXiv Detail & Related papers (2025-07-15T20:19:32Z) - Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback [49.84060509296641]
We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit)<n>Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory.<n>We introduce the first Policy Optimization algorithms for this setting.
arXiv Detail & Related papers (2025-02-06T12:03:24Z) - Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z) - No-Regret Online Reinforcement Learning with Adversarial Losses and
Transitions [39.864174754882754]
Existing online learning algorithms for adversarial Markov Decision Processes achieve $O(sqrtT)$ regret after $T$ rounds of interactions.
This is because it has been shown that adversarial transition functions make no-regret learning impossible.
We propose an algorithm that enjoys $widetildeO(sqrtT + CtextsfP)$ regret where $CtextsfP$ measures how adversarial the transition functions are.
arXiv Detail & Related papers (2023-05-27T06:10:17Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Decentralized Cooperative Reinforcement Learning with Hierarchical
Information Structure [14.919120396838208]
We consider two-agent multi-armed bandits (MABs) and Markov decision processes (MDPs) with a hierarchical information structure arising in applications.
In each step the leader" chooses her action first, and then the follower" decides his action after observing the leader's action.
For the MDP setting, we obtain $widetildemathcalO(sqrtH7S2ABT)$ regret, where $H$ is the number of steps per episode, $S$ is the number of states
arXiv Detail & Related papers (2021-11-01T09:18:07Z) - Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$.
We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.