Quantile Markov Decision Process
- URL: http://arxiv.org/abs/1711.05788v5
- Date: Wed, 15 Oct 2025 17:06:50 GMT
- Title: Quantile Markov Decision Process
- Authors: Xiaocheng Li, Huaiyang Zhong, Margaret L. Brandeau,
- Abstract summary: We consider the problem of optimizing the quantiles of the cumulative rewards of a Markov decision process (MDP)<n>We provide analytical results characterizing the optimal QMDP value function and present a dynamic programming-based algorithm to solve for the optimal policy.<n>We illustrate the practical relevance of our model by evaluating it on an HIV treatment initiation problem, where patients aim to balance the potential benefits and risks of the treatment.
- Score: 6.401262949607737
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of a traditional Markov decision process (MDP) is to maximize expected cumulative reward over a defined horizon (possibly infinite). In many applications, however, a decision maker may be interested in optimizing a specific quantile of the cumulative reward instead of its expectation. In this paper we consider the problem of optimizing the quantiles of the cumulative rewards of a Markov decision process (MDP), which we refer to as a quantile Markov decision process (QMDP). We provide analytical results characterizing the optimal QMDP value function and present a dynamic programming-based algorithm to solve for the optimal policy. The algorithm also extends to the MDP problem with a conditional value-at-risk (CVaR) objective. We illustrate the practical relevance of our model by evaluating it on an HIV treatment initiation problem, where patients aim to balance the potential benefits and risks of the treatment.
Related papers
- Quantizer Design for Finite Model Approximations, Model Learning, and Quantized Q-Learning for MDPs with Unbounded Spaces [0.0]
We present refined upper bounds presented in [Kara et. al. JMLR'23] on finite model approximation errors.<n>We also consider implications on quantizer design for quantized Q-learning and empirical model learning.
arXiv Detail & Related papers (2025-10-05T20:39:52Z) - Recursive Reward Aggregation [60.51668865089082]
We propose an alternative approach for flexible behavior alignment that eliminates the need to modify the reward function.<n>By introducing an algebraic perspective on Markov decision processes (MDPs), we show that the Bellman equations naturally emerge from the generation and aggregation of rewards.<n>Our approach applies to both deterministic and deterministic settings and seamlessly integrates with value-based and actor-critic algorithms.
arXiv Detail & Related papers (2025-07-11T12:37:20Z) - Q-learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis [30.713243690224207]
In Markov decision processes (MDPs), quantile risk measures such as Value-at-Risk are a standard metric for modeling RL agents' preferences for certain outcomes.
This paper proposes a new Q-learning algorithm for quantile optimization in MDPs with strong convergence and performance guarantees.
arXiv Detail & Related papers (2024-10-31T16:53:20Z) - Pointer Networks with Q-Learning for Combinatorial Optimization [55.2480439325792]
We introduce the Pointer Q-Network (PQN), a hybrid neural architecture that integrates model-free Q-value policy approximation with Pointer Networks (Ptr-Nets)
Our empirical results demonstrate the efficacy of this approach, also testing the model in unstable environments.
arXiv Detail & Related papers (2023-11-05T12:03:58Z) - Beyond Average Return in Markov Decision Processes [49.157108194438635]
We prove that only generalized means can be optimized exactly, even in the more general framework of Distributional Reinforcement Learning (DistRL).
We provide error bounds on the resulting estimators, and discuss the potential of this approach as well as its limitations.
arXiv Detail & Related papers (2023-10-31T08:36:41Z) - Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase.
We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs.
To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z) - Differentially Private Deep Q-Learning for Pattern Privacy Preservation
in MEC Offloading [76.0572817182483]
attackers may eavesdrop on the offloading decisions to infer the edge server's (ES's) queue information and users' usage patterns.
We propose an offloading strategy which jointly minimizes the latency, ES's energy consumption, and task dropping rate, while preserving pattern privacy (PP)
We develop a Differential Privacy Deep Q-learning based Offloading (DP-DQO) algorithm to solve this problem while addressing the PP issue by injecting noise into the generated offloading decisions.
arXiv Detail & Related papers (2023-02-09T12:50:18Z) - Bayesian sequential design of computer experiments for quantile set inversion [0.0]
We consider an unknown multivariate function representing a system-such as a complex numerical simulator.<n>Our objective is to estimate the set of deterministic inputs leading to outputs whose probability is less than a given threshold.
arXiv Detail & Related papers (2022-11-02T10:14:05Z) - Model-Free Reinforcement Learning for Optimal Control of MarkovDecision
Processes Under Signal Temporal Logic Specifications [7.842869080999489]
We present a model-free reinforcement learning algorithm to find an optimal policy for a finite-horizon Markov decision process.
We illustrate the effectiveness of our approach in the context of robotic motion planning for complex missions under uncertainty and performance objectives.
arXiv Detail & Related papers (2021-09-27T22:44:55Z) - Risk-Averse Decision Making Under Uncertainty [18.467950783426947]
A large class of decision making under uncertainty problems can be described via Markov decision processes (MDPs) or partially observable MDPs (POMDPs)
In this paper, we consider the problem of designing policies for MDPs and POMDPs with objectives and constraints in terms of dynamic coherent risk measures.
arXiv Detail & Related papers (2021-09-09T07:52:35Z) - Regret Analysis in Deterministic Reinforcement Learning [78.31410227443102]
We study the problem of regret, which is central to the analysis and design of optimal learning algorithms.
We present logarithmic problem-specific regret lower bounds that explicitly depend on the system parameter.
arXiv Detail & Related papers (2021-06-27T23:41:57Z) - Correct-by-construction reach-avoid control of partially observable
linear stochastic systems [7.912008109232803]
We formalize a robust feedback controller for reach-avoid control of discrete-time, linear time-invariant systems.
The problem is to compute a controller that satisfies the required provestate abstraction problem.
arXiv Detail & Related papers (2021-03-03T13:46:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.