Beyond Scalar Rewards: An Axiomatic Framework for Lexicographic MDPs
- URL: http://arxiv.org/abs/2505.12049v1
- Date: Sat, 17 May 2025 15:23:58 GMT
- Title: Beyond Scalar Rewards: An Axiomatic Framework for Lexicographic MDPs
- Authors: Mehran Shakerinava, Siamak Ravanbakhsh, Adam Oberman,
- Abstract summary: Hausner's foundational work showed that dropping the continuity axiom leads to a generalization of expected utility theory.<n>We provide a full characterization of such reward functions, as well as the general d-dimensional case, in Markov Decision Processes (MDPs) under a memorylessness assumption on preferences.<n>We show that optimal policies in this setting retain many desirable properties of their scalar-reward counterparts, while in the Constrained MDP setting -- another common multiobjective setting -- they do not.
- Score: 18.48866194756127
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work has formalized the reward hypothesis through the lens of expected utility theory, by interpreting reward as utility. Hausner's foundational work showed that dropping the continuity axiom leads to a generalization of expected utility theory where utilities are lexicographically ordered vectors of arbitrary dimension. In this paper, we extend this result by identifying a simple and practical condition under which preferences cannot be represented by scalar rewards, necessitating a 2-dimensional reward function. We provide a full characterization of such reward functions, as well as the general d-dimensional case, in Markov Decision Processes (MDPs) under a memorylessness assumption on preferences. Furthermore, we show that optimal policies in this setting retain many desirable properties of their scalar-reward counterparts, while in the Constrained MDP (CMDP) setting -- another common multiobjective setting -- they do not.
Related papers
- Dynamic and Generalizable Process Reward Modeling [74.36829922727026]
We propose Dynamic and Generalizable Process Reward Modeling (DG-PRM), which features a reward tree to capture and store fine-grained, multi-dimensional reward criteria.<n> Experimental results show that DG-PRM achieves stunning performance on prevailing benchmarks, significantly boosting model performance across tasks with dense rewards.
arXiv Detail & Related papers (2025-07-23T18:17:22Z) - Recursive Reward Aggregation [51.552609126905885]
We propose an alternative approach for flexible behavior alignment that eliminates the need to modify the reward function.<n>By introducing an algebraic perspective on Markov decision processes (MDPs), we show that the Bellman equations naturally emerge from the generation and aggregation of rewards.<n>Our approach applies to both deterministic and deterministic settings and seamlessly integrates with value-based and actor-critic algorithms.
arXiv Detail & Related papers (2025-07-11T12:37:20Z) - Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits [58.63897489864948]
Reinforcement learning with outcome-based feedback faces a fundamental challenge.<n>How do we assign credit to the right actions?<n>This paper provides the first comprehensive analysis of this problem in online RL with general function approximation.
arXiv Detail & Related papers (2025-05-26T17:44:08Z) - Understanding Reward Ambiguity Through Optimal Transport Theory in
Inverse Reinforcement Learning [4.8951183832371]
inverse reinforcement learning (IRL) aims to infer underlying reward functions from observed expert behaviors.
Current methods often face challenges with high-dimensional problems and lack a geometric foundation.
This paper harnesses the optimal transport (OT) theory to provide a fresh perspective on these challenges.
arXiv Detail & Related papers (2023-10-18T15:42:53Z) - Submodular Reinforcement Learning [38.40138241424851]
In reinforcement learning (RL), rewards of states are typically considered additive, and following the Markov assumption, they are $textitindependent$ states visited previously.
In many important applications, such as coverage control, experiment design and informative path planning, rewards naturally have diminishing returns, i.e., their value decreases in light of similar states visited previously.
We propose $textitsubmodular RL$ (SubRL), a paradigm which seeks to optimize more general, non-additive (and history-dependent) rewards modelled via submodular set functions which capture diminishing returns
arXiv Detail & Related papers (2023-07-25T09:46:02Z) - Utility Theory for Sequential Decision Making [20.7262938359876]
We show that memoryless preferences lead to a utility in the form of a per transition reward and multiplicative factor on the future return.
We demystify the reward hypothesis that underlies the design of rational agents in reinforcement learning.
arXiv Detail & Related papers (2022-06-27T21:28:35Z) - Reinforcement Learning in Linear MDPs: Constant Regret and
Representation Selection [136.4014229319618]
We study the role of the representation of state-action value functions in regret minimization in finite-horizon Markov Decision Processes (MDPs) with linear structure.
We first derive a necessary condition on the representation, called universally spanning optimal features (UNISOFT), to achieve constant regret in any MDP with linear reward function.
arXiv Detail & Related papers (2021-10-27T22:07:08Z) - Probabilistic Entity Representation Model for Chain Reasoning over
Knowledge Graphs [18.92547855877845]
We propose a Probabilistic Entity Representation Model (PERM) for logical reasoning over Knowledge Graphs.
PERM encodes entities as a Multivariate Gaussian density with mean and covariance parameters to capture semantic position and smooth decision boundary.
We demonstrate PERM's competence on a COVID-19 drug-repurposing case study and show that our proposed work is able to recommend drugs with substantially better F1 than current methods.
arXiv Detail & Related papers (2021-10-26T09:26:10Z) - Self-Supervised Online Reward Shaping in Sparse-Reward Environments [36.01839934355542]
We propose a novel reinforcement learning framework that performs self-supervised online reward shaping.
The proposed framework alternates between updating a policy and inferring a reward function.
Experimental results on several sparse-reward environments demonstrate that the proposed algorithm is significantly more sample efficient than the state-of-the-art baseline.
arXiv Detail & Related papers (2021-03-08T03:28:04Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z) - A maximum-entropy approach to off-policy evaluation in average-reward
MDPs [54.967872716145656]
This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs)
We provide the first finite-sample OPE error bound, extending existing results beyond the episodic and discounted cases.
We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning.
arXiv Detail & Related papers (2020-06-17T18:13:37Z) - Maximizing Information Gain in Partially Observable Environments via
Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent.
We derive the exact error between negative entropy and the expected prediction reward.
This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z) - Finding Fair and Efficient Allocations When Valuations Don't Add Up [25.962505544590947]
We show that when agent valuations are matroid rank functions, a socially optimal (i.e. utilitarian social welfare-maximizing) achieves allocation that envy-freeness up to one item (EF1) exists and is computationally tractable.
This is the first valuation function class not subsumed by additive valuations for which it has been established that an allocation maximizing Nash welfare is EF1.
arXiv Detail & Related papers (2020-03-16T07:42:27Z) - Target-Embedding Autoencoders for Supervised Representation Learning [111.07204912245841]
This paper analyzes a framework for improving generalization in a purely supervised setting, where the target space is high-dimensional.
We motivate and formalize the general framework of target-embedding autoencoders (TEA) for supervised prediction, learning intermediate latent representations jointly optimized to be both predictable from features as well as predictive of targets.
arXiv Detail & Related papers (2020-01-23T02:37:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.