Beyond Average Return in Markov Decision Processes
- URL: http://arxiv.org/abs/2310.20266v2
- Date: Mon, 19 Feb 2024 10:46:18 GMT
- Title: Beyond Average Return in Markov Decision Processes
- Authors: Alexandre Marthe (ENS de Lyon, UMPA-ENSL), Aur\'elien Garivier
(UMPA-ENSL (MC2)), Claire Vernade
- Abstract summary: We prove that only generalized means can be optimized exactly, even in the more general framework of Distributional Reinforcement Learning (DistRL).
We provide error bounds on the resulting estimators, and discuss the potential of this approach as well as its limitations.
- Score: 49.157108194438635
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: What are the functionals of the reward that can be computed and optimized
exactly in Markov Decision Processes?In the finite-horizon, undiscounted
setting, Dynamic Programming (DP) can only handle these operations efficiently
for certain classes of statistics. We summarize the characterization of these
classes for policy evaluation, and give a new answer for the planning problem.
Interestingly, we prove that only generalized means can be optimized exactly,
even in the more general framework of Distributional Reinforcement Learning
(DistRL).DistRL permits, however, to evaluate other functionals approximately.
We provide error bounds on the resulting estimators, and discuss the potential
of this approach as well as its limitations.These results contribute to
advancing the theory of Markov Decision Processes by examining overall
characteristics of the return, and particularly risk-conscious strategies.
Related papers
- Recursive Reward Aggregation [51.552609126905885]
We propose an alternative approach for flexible behavior alignment that eliminates the need to modify the reward function.<n>By introducing an algebraic perspective on Markov decision processes (MDPs), we show that the Bellman equations naturally emerge from the generation and aggregation of rewards.<n>Our approach applies to both deterministic and deterministic settings and seamlessly integrates with value-based and actor-critic algorithms.
arXiv Detail & Related papers (2025-07-11T12:37:20Z) - Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits [58.63897489864948]
Reinforcement learning with outcome-based feedback faces a fundamental challenge.<n>How do we assign credit to the right actions?<n>This paper provides the first comprehensive analysis of this problem in online RL with general function approximation.
arXiv Detail & Related papers (2025-05-26T17:44:08Z) - Counterfactual Strategies for Markov Decision Processes [3.42834279186368]
We introduce counterfactual strategies for Markov Decision Processes (MDPs)<n>During MDP execution, a strategy decides which of the enabled actions to execute next.<n>We identify minimal changes to the initial strategy to reduce that probability below the limit.
arXiv Detail & Related papers (2025-05-14T14:07:27Z) - Fair Resource Allocation in Weakly Coupled Markov Decision Processes [3.824858358548714]
We consider fair resource allocation in sequential decision-making environments modeled as weakly coupled Markov decision processes.
We adopt a fairness definition using the generalized Gini function instead of the traditional utilitarian (total-sum) objective.
arXiv Detail & Related papers (2024-11-14T20:40:55Z) - On Policy Evaluation Algorithms in Distributional Reinforcement Learning [0.0]
We introduce a novel class of algorithms to efficiently approximate the unknown return distributions in policy evaluation problems from distributional reinforcement learning (DRL)
For a plain instance of our proposed class of algorithms we prove error bounds, both within Wasserstein and Kolmogorov--Smirnov distances.
For return distributions having probability density functions the algorithms yield approximations for these densities; error bounds are given within supremum norm.
arXiv Detail & Related papers (2024-07-19T10:06:01Z) - Risk-sensitive Markov Decision Process and Learning under General
Utility Functions [3.6260136172126667]
Reinforcement Learning (RL) has gained substantial attention across diverse application domains and theoretical investigations.
We propose a modified value algorithm that employs an epsilon-covering over the space of cumulative reward.
In the absence of a simulator, our algorithm, designed with an upper-confidence-bound exploration approach, identifies a near-optimal policy.
arXiv Detail & Related papers (2023-11-22T18:50:06Z) - Provably Efficient UCB-type Algorithms For Learning Predictive State
Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs)
This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models.
In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z) - GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP,
and Beyond [101.5329678997916]
We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making.
We propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation.
We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR.
arXiv Detail & Related papers (2022-11-03T16:42:40Z) - Offline Policy Optimization with Eligible Actions [34.4530766779594]
offline policy optimization could have a large impact on many real-world decision-making problems.
Importance sampling and its variants are a commonly used type of estimator in offline policy evaluation.
We propose an algorithm to avoid this overfitting through a new per-state-neighborhood normalization constraint.
arXiv Detail & Related papers (2022-07-01T19:18:15Z) - Off-Policy Evaluation with Policy-Dependent Optimization Response [90.28758112893054]
We develop a new framework for off-policy evaluation with a textitpolicy-dependent linear optimization response.
We construct unbiased estimators for the policy-dependent estimand by a perturbation method.
We provide a general algorithm for optimizing causal interventions.
arXiv Detail & Related papers (2022-02-25T20:25:37Z) - Identification of Unexpected Decisions in Partially Observable
Monte-Carlo Planning: a Rule-Based Approach [78.05638156687343]
We propose a methodology for analyzing POMCP policies by inspecting their traces.
The proposed method explores local properties of policy behavior to identify unexpected decisions.
We evaluate our approach on Tiger, a standard benchmark for POMDPs, and a real-world problem related to mobile robot navigation.
arXiv Detail & Related papers (2020-12-23T15:09:28Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.