On the Expressivity of Multidimensional Markov Reward
- URL: http://arxiv.org/abs/2307.12184v1
- Date: Sat, 22 Jul 2023 23:17:44 GMT
- Title: On the Expressivity of Multidimensional Markov Reward
- Authors: Shuwa Miura
- Abstract summary: We consider the expressivity of Markov rewards in sequential decision making under uncertainty.
We show that for every non-degenerate set of deterministic policies, there exists a multidimensional Markov reward function.
- Score: 0.6853165736531939
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the expressivity of Markov rewards in sequential decision making
under uncertainty. We view reward functions in Markov Decision Processes (MDPs)
as a means to characterize desired behaviors of agents. Assuming desired
behaviors are specified as a set of acceptable policies, we investigate if
there exists a scalar or multidimensional Markov reward function that makes the
policies in the set more desirable than the other policies. Our main result
states both necessary and sufficient conditions for the existence of such
reward functions. We also show that for every non-degenerate set of
deterministic policies, there exists a multidimensional Markov reward function
that characterizes it
Related papers
- Non-maximizing policies that fulfill multi-criterion aspirations in expectation [0.7874708385247353]
In dynamic programming and reinforcement learning, the policy for the sequential decision making of an agent is usually determined by expressing the goal as a scalar reward function.
We consider finite acyclic Decision Markov Processes with multiple distinct evaluation metrics, which do not necessarily represent quantities that the user wants to be maximized.
Our algorithm guarantees that this task is fulfilled by using simplices to approximate feasibility sets and propagate aspirations forward while ensuring they remain feasible.
arXiv Detail & Related papers (2024-08-08T11:41:04Z) - Principal-Agent Reward Shaping in MDPs [50.914110302917756]
Principal-agent problems arise when one party acts on behalf of another, leading to conflicts of interest.
We study a two-player Stack game where the principal and the agent have different reward functions, and the agent chooses an MDP policy for both players.
Our results establish trees and deterministic decision processes with a finite horizon.
arXiv Detail & Related papers (2023-12-30T18:30:44Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - Utility Theory for Sequential Decision Making [20.7262938359876]
We show that memoryless preferences lead to a utility in the form of a per transition reward and multiplicative factor on the future return.
We demystify the reward hypothesis that underlies the design of rational agents in reinforcement learning.
arXiv Detail & Related papers (2022-06-27T21:28:35Z) - Efficient Policy Iteration for Robust Markov Decision Processes via
Regularization [49.05403412954533]
Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known.
Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs.
In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
arXiv Detail & Related papers (2022-05-28T04:05:20Z) - Markov Abstractions for PAC Reinforcement Learning in Non-Markov
Decision Processes [90.53326983143644]
We show that Markov abstractions can be learned during reinforcement learning.
We show that our approach has PAC guarantees when the employed algorithms have PAC guarantees.
arXiv Detail & Related papers (2022-04-29T16:53:00Z) - On the Expressivity of Markov Reward [89.96685777114456]
This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform.
We frame this study around three new abstract notions of "task" that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories.
arXiv Detail & Related papers (2021-11-01T12:12:16Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Identifiability in inverse reinforcement learning [0.0]
Inverse reinforcement learning attempts to reconstruct the reward function in a Markov decision problem.
We provide a resolution to this non-identifiability for problems with entropy regularization.
arXiv Detail & Related papers (2021-06-07T10:35:52Z) - Online Learning of Non-Markovian Reward Models [2.064612766965483]
We consider a non-Markovian reward decision process (MDP) that models the dynamics of the environment in which the agent evolves.
While the MDP is known by the agent, the reward function is unknown to the agent and must be learned.
We use Angluin's $L*$ active learning algorithm to learn a Mealy machine representing the underlying non-Markovian reward machine.
arXiv Detail & Related papers (2020-09-26T13:54:34Z) - Learning Non-Markovian Reward Models in MDPs [0.0]
We show how to formalise the non-Markovian reward function using a Mealy machine.
In our formal setting, we consider a Markov decision process (MDP) that models the dynamic of the environment in which the agent evolves.
While the MDP is known by the agent, the reward function is unknown from the agent and must be learnt.
arXiv Detail & Related papers (2020-01-25T10:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.