Efficient Computation of Blackwell Optimal Policies using Rational Functions
- URL: http://arxiv.org/abs/2508.18252v1
- Date: Mon, 25 Aug 2025 17:41:30 GMT
- Title: Efficient Computation of Blackwell Optimal Policies using Rational Functions
- Authors: Dibyangshu Mukherjee, Shivaram Kalyanakrishnan,
- Abstract summary: Decision Problems (MDPs) provide a foundational framework for modelling sequential decision-making across diverse domains.<n> discounted optimality may overly prioritise short-term rewards, while average optimality relies on strong structural assumptions.<n>Blackwell optimality addresses these challenges, offering a robust and comprehensive criterion that ensures optimality under both discounted and average reward frameworks.
- Score: 3.0529230554642752
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Markov Decision Problems (MDPs) provide a foundational framework for modelling sequential decision-making across diverse domains, guided by optimality criteria such as discounted and average rewards. However, these criteria have inherent limitations: discounted optimality may overly prioritise short-term rewards, while average optimality relies on strong structural assumptions. Blackwell optimality addresses these challenges, offering a robust and comprehensive criterion that ensures optimality under both discounted and average reward frameworks. Despite its theoretical appeal, existing algorithms for computing Blackwell Optimal (BO) policies are computationally expensive or hard to implement. In this paper we describe procedures for computing BO policies using an ordering of rational functions in the vicinity of $1$. We adapt state-of-the-art algorithms for deterministic and general MDPs, replacing numerical evaluations with symbolic operations on rational functions to derive bounds independent of bit complexity. For deterministic MDPs, we give the first strongly polynomial-time algorithms for computing BO policies, and for general MDPs we obtain the first subexponential-time algorithm. We further generalise several policy iteration algorithms, extending the best known upper bounds from the discounted to the Blackwell criterion.
Related papers
- Principled Algorithms for Optimizing Generalized Metrics in Binary Classification [53.604375124674796]
We introduce principled algorithms for optimizing generalized metrics, supported by $H$-consistency and finite-sample generalization bounds.<n>Our approach reformulates metric optimization as a generalized cost-sensitive learning problem.<n>We develop new algorithms, METRO, with strong theoretical performance guarantees.
arXiv Detail & Related papers (2025-12-29T01:33:42Z) - Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z) - A Principled Approach to Randomized Selection under Uncertainty: Applications to Peer Review and Grant Funding [68.43987626137512]
We propose a principled framework for randomized decision-making based on interval estimates of the quality of each item.<n>We introduce MERIT, an optimization-based method that maximizes the worst-case expected number of top candidates selected.<n>We prove that MERIT satisfies desirable axiomatic properties not guaranteed by existing approaches.
arXiv Detail & Related papers (2025-06-23T19:59:30Z) - Best-of-Both-Worlds Policy Optimization for CMDPs with Bandit Feedback [34.7178680288326]
Stradi et al.(2024) proposed the first best-of-both-worlds algorithm for constrained Markov decision processes.<n>In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback.<n>Our algorithm is based on a policy optimization approach, which is much more efficient than occupancy-measure-based methods.
arXiv Detail & Related papers (2024-10-03T07:44:40Z) - e-COP : Episodic Constrained Optimization of Policies [12.854752753529151]
We present the first policy optimization algorithm for constrained Reinforcement Learning (RL) in episodic (finite horizon) settings.<n>We show that our algorithm has similar or better performance than SoTA (non-episodic) algorithms adapted for the episodic setting.
arXiv Detail & Related papers (2024-06-13T20:12:09Z) - Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs [17.62509045102346]
This paper considers the best policy identification problem in online Constrained Markov Decision Processes (CMDPs)
We are interested in algorithms that are model-free, have low regret, and identify an approximately optimal policy with a high probability.
Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy.
arXiv Detail & Related papers (2023-09-27T04:33:09Z) - Policy Gradient Algorithms for Robust MDPs with Non-Rectangular Uncertainty Sets [10.560809881699468]
We propose policy gradient algorithms for robust infinite-horizon Markov decision processes (MDPs) with non-rectangular uncertainty sets.<n>The corresponding robust MDPs cannot be solved with dynamic programming techniques and are in fact provably intractable.<n>We thus present the first complete solution scheme for robust MDPs with non-rectangular uncertainty sets offering global optimality guarantees.
arXiv Detail & Related papers (2023-05-30T13:02:25Z) - ACPO: A Policy Optimization Algorithm for Average MDPs with Constraints [36.16736392624796]
We introduce a new policy optimization with function approximation algorithm for constrained MDPs with the average criterion.
We develop basic sensitivity theory for average CMDPs, and then use the corresponding bounds in the design of the algorithm.
We show its superior empirical performance when compared to other state-of-the-art algorithms adapted for the ACMDPs.
arXiv Detail & Related papers (2023-02-02T00:23:36Z) - Efficient Policy Iteration for Robust Markov Decision Processes via
Regularization [49.05403412954533]
Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known.
Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs.
In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
arXiv Detail & Related papers (2022-05-28T04:05:20Z) - Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs [113.8752163061151]
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs)<n>We propose the underlineperiodically underlinerestarted underlineoptimistic underlinepolicy underlineoptimization algorithm (PROPO)<n>PROPO features two mechanisms: sliding-window-based policy evaluation and periodic-restart-based policy improvement.
arXiv Detail & Related papers (2021-10-18T02:33:20Z) - On the Optimality of Batch Policy Optimization Algorithms [106.89498352537682]
Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment.
We show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral.
We introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.
arXiv Detail & Related papers (2021-04-06T05:23:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.