Towards Blackwell Optimality: Bellman Optimality Is All You Can Get
- URL: http://arxiv.org/abs/2510.13476v1
- Date: Wed, 15 Oct 2025 12:22:25 GMT
- Title: Towards Blackwell Optimality: Bellman Optimality Is All You Can Get
- Authors: Victor Boone, Adrienne Tuynman,
- Abstract summary: We investigate the problem of identifying policies of optimality orders in Markov Decision Processes.<n>For each order, we construct a learning algorithm with vanishing probability of error.<n>We characterize the class of MDPs for which identification algorithms can stop in finite time.
- Score: 7.534196213324318
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although average gain optimality is a commonly adopted performance measure in Markov Decision Processes (MDPs), it is often too asymptotic. Further incorporating measures of immediate losses leads to the hierarchy of bias optimalities, all the way up to Blackwell optimality. In this paper, we investigate the problem of identifying policies of such optimality orders. To that end, for each order, we construct a learning algorithm with vanishing probability of error. Furthermore, we characterize the class of MDPs for which identification algorithms can stop in finite time. That class corresponds to the MDPs with a unique Bellman optimal policy, and does not depend on the optimality order considered. Lastly, we provide a tractable stopping rule that when coupled to our learning algorithm triggers in finite time whenever it is possible to do so.
Related papers
- Efficient Computation of Blackwell Optimal Policies using Rational Functions [3.0529230554642752]
Decision Problems (MDPs) provide a foundational framework for modelling sequential decision-making across diverse domains.<n> discounted optimality may overly prioritise short-term rewards, while average optimality relies on strong structural assumptions.<n>Blackwell optimality addresses these challenges, offering a robust and comprehensive criterion that ensures optimality under both discounted and average reward frameworks.
arXiv Detail & Related papers (2025-08-25T17:41:30Z) - Anytime-Constrained Reinforcement Learning [6.981971551979697]
We introduce and study constrained Markov Decision Processes (cMDPs) with anytime constraints.
We show that there exist optimal deterministic policies augmented with cumulative costs.
We show that computing non-trivial approximately optimal policies is NP-hard in general.
arXiv Detail & Related papers (2023-11-09T16:51:26Z) - Policy Gradient Algorithms for Robust MDPs with Non-Rectangular Uncertainty Sets [10.560809881699468]
We propose policy gradient algorithms for robust infinite-horizon Markov decision processes (MDPs) with non-rectangular uncertainty sets.<n>The corresponding robust MDPs cannot be solved with dynamic programming techniques and are in fact provably intractable.<n>We thus present the first complete solution scheme for robust MDPs with non-rectangular uncertainty sets offering global optimality guarantees.
arXiv Detail & Related papers (2023-05-30T13:02:25Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.<n>Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.<n>We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds
for Episodic Reinforcement Learning [50.44564503645015]
We provide improved gap-dependent regret bounds for reinforcement learning in finite episodic Markov decision processes.
We prove tighter upper regret bounds for optimistic algorithms and accompany them with new information-theoretic lower bounds for a large class of MDPs.
arXiv Detail & Related papers (2021-07-02T20:36:05Z) - On the Optimality of Batch Policy Optimization Algorithms [106.89498352537682]
Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment.
We show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral.
We introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.
arXiv Detail & Related papers (2021-04-06T05:23:20Z) - Online Model Selection for Reinforcement Learning with Function
Approximation [50.008542459050155]
We present a meta-algorithm that adapts to the optimal complexity with $tildeO(L5/6 T2/3)$ regret.
We also show that the meta-algorithm automatically admits significantly improved instance-dependent regret bounds.
arXiv Detail & Related papers (2020-11-19T10:00:54Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Stochastic Optimization Forests [60.523606291705214]
We show how to train forest decision policies by growing trees that choose splits to directly optimize the downstream decision quality, rather than splitting to improve prediction accuracy as in the standard random forest algorithm.
We show that our approximate splitting criteria can reduce running time hundredfold, while achieving performance close to forest algorithms that exactly re-optimize for every candidate split.
arXiv Detail & Related papers (2020-08-17T16:56:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.