Online Markov Decision Processes with Terminal Law Constraints
- URL: http://arxiv.org/abs/2601.07492v1
- Date: Mon, 12 Jan 2026 12:46:12 GMT
- Title: Online Markov Decision Processes with Terminal Law Constraints
- Authors: Bianca Marin Moreno, Margaux Brégère, Pierre Gaillard, Nadia Oudjane,
- Abstract summary: We introduce a reset-free framework called the periodic framework.<n>The goal is to find periodic policies that minimize cumulative loss and return the agents to their initial state distribution after a fixed number of steps.<n>We show first algorithms for computing periodic policies in two multi-agent settings and show they achieve sublinear periodic regret of order $tilde O(T3/4)$.<n>This provides the first non-asymptotic guarantees for reset-free learning in the setting of $M$ homogeneous agents, for $M > 1$.
- Score: 10.878763806286157
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional reinforcement learning usually assumes either episodic interactions with resets or continuous operation to minimize average or cumulative loss. While episodic settings have many theoretical results, resets are often unrealistic in practice. The infinite-horizon setting avoids this issue but lacks non-asymptotic guarantees in online scenarios with unknown dynamics. In this work, we move towards closing this gap by introducing a reset-free framework called the periodic framework, where the goal is to find periodic policies: policies that not only minimize cumulative loss but also return the agents to their initial state distribution after a fixed number of steps. We formalize the problem of finding optimal periodic policies and identify sufficient conditions under which it is well-defined for tabular Markov decision processes. To evaluate algorithms in this framework, we introduce the periodic regret, a measure that balances cumulative loss with the terminal law constraint. We then propose the first algorithms for computing periodic policies in two multi-agent settings and show they achieve sublinear periodic regret of order $\tilde O(T^{3/4})$. This provides the first non-asymptotic guarantees for reset-free learning in the setting of $M$ homogeneous agents, for $M > 1$.
Related papers
- Decoupled Continuous-Time Reinforcement Learning via Hamiltonian Flow [1.8824572526199168]
Real-world control problems evolve in continuous time with non-uniform, event-driven decisions.<n>As time gaps shrink, the $Q$-function collapses to the value function $V$, eliminating action ranking.<n>Existing continuous-time methods reintroduce action information via an advantage-rate function $q$.<n>We propose a novel decoupled continuous-time actor-critic algorithm with alternating updates.
arXiv Detail & Related papers (2026-02-16T09:35:25Z) - Revisiting Weighted Strategy for Non-stationary Parametric Bandits and MDPs [56.246783503873225]
This paper revisits the weighted strategy for non-stationary parametric bandits.<n>We propose a simpler weight-based algorithm that is as efficient as window/restart-based algorithms.<n>Our framework can be used to improve regret bounds of other parametric bandits.
arXiv Detail & Related papers (2026-01-03T04:50:21Z) - Quantile-Optimal Policy Learning under Unmeasured Confounding [55.72891849926314]
We study quantile-optimal policy learning where the goal is to find a policy whose reward distribution has the largest $alpha$-quantile for some $alpha in (0, 1)$.<n>Such a problem suffers from three main challenges: (i) nonlinearity of the quantile objective as a functional of the reward distribution, (ii) unobserved confounding issue, and (iii) insufficient coverage of the offline dataset.
arXiv Detail & Related papers (2025-06-08T13:37:38Z) - Optimal Rates in Continual Linear Regression via Increasing Regularization [39.30412893918111]
We study realizable continual linear regression under random task orderings.<n>In this setup, the worst-case expected loss after $k$ learning admits a lower bound of $Omega (1/k)$.<n>We use two frequently used regularization schemes: explicit isotropic $ell$ regularization, and implicit regularization via finite step budgets.
arXiv Detail & Related papers (2025-06-06T19:51:14Z) - Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs [63.47351876442425]
We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback.
We propose a novel algorithm that combines the benefits of two popular methods: occupancy-measure-based and policy-based.
Our algorithm enjoys an $widetildemathcalO(d sqrtH3 K + sqrtHK(H + barP_K$)$ dynamic regret, where $d$ is the feature dimension.
arXiv Detail & Related papers (2024-11-05T13:55:52Z) - Handling Cost and Constraints with Off-Policy Deep Reinforcement
Learning [2.793095554369282]
Most popular methods for off-policy learning include policy improvement steps where a learned state-action ($Q$) value function is maximized over selected batches of data.
We revisit this strategy in environments with "mixed-sign" reward functions.
We find that this second approach, when applied to continuous action spaces with mixed-sign rewards, consistently and significantly outperforms state-of-the-art methods augmented by resetting.
arXiv Detail & Related papers (2023-11-30T16:31:04Z) - Best of Both Worlds Policy Optimization [33.13041034490332]
We show that by properly designing the regularizer, the exploration bonus and the learning rates, one can achieve a more favorable polylog$(T)$ regret when the losses are adversarial.
This is the first time a gap-dependent polylog$(T)$ regret bound is shown for policy optimization.
arXiv Detail & Related papers (2023-02-18T19:46:11Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - Acting in Delayed Environments with Non-Stationary Markov Policies [57.52103323209643]
We introduce a framework for learning and planning in MDPs where the decision-maker commits actions that are executed with a delay of $m$ steps.
We prove that with execution delay, deterministic Markov policies in the original state-space are sufficient for attaining maximal reward, but need to be non-stationary.
We devise a non-stationary Q-learning style model-based algorithm that solves delayed execution tasks without resorting to state-augmentation.
arXiv Detail & Related papers (2021-01-28T13:35:37Z) - Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities.
We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy.
A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.