Related papers: Detecting Hidden Triggers: Mapping Non-Markov Reward Functions to Markov

Detecting Hidden Triggers: Mapping Non-Markov Reward Functions to Markov

URL: http://arxiv.org/abs/2401.11325v2
Date: Mon, 29 Apr 2024 18:26:15 GMT
Title: Detecting Hidden Triggers: Mapping Non-Markov Reward Functions to Markov
Authors: Gregory Hyde, Eugene Santos Jr,
Abstract summary: We propose a framework for mapping non-Markov reward functions into equivalent Markov ones by learning a Reward Machine. Unlike the general practice of learning Reward Machines, we do not require a set of high-level propositional symbols from which to learn. We empirically validate our approach by learning black-box non-Markov Reward functions in the Officeworld Domain.
Score: 2.486161976966064
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many Reinforcement Learning algorithms assume a Markov reward function to guarantee optimality. However, not all reward functions are known to be Markov. In this paper, we propose a framework for mapping non-Markov reward functions into equivalent Markov ones by learning a Reward Machine - a specialized reward automaton. Unlike the general practice of learning Reward Machines, we do not require a set of high-level propositional symbols from which to learn. Rather, we learn \emph{hidden triggers} directly from data that encode them. We demonstrate the importance of learning Reward Machines versus their Deterministic Finite-State Automata counterparts, for this task, given their ability to model reward dependencies in a single automaton. We formalize this distinction in our learning objective. Our mapping process is constructed as an Integer Linear Programming problem. We prove that our mappings provide consistent expectations for the underlying process. We empirically validate our approach by learning black-box non-Markov Reward functions in the Officeworld Domain. Additionally, we demonstrate the effectiveness of learning dependencies between rewards in a new domain, Breakfastworld.

Related papers

Learning to Reason without External Rewards [100.27210579418562]
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision.<n>We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data.<n>We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal.
arXiv Detail & Related papers (2025-05-26T07:01:06Z)
Walking the Values in Bayesian Inverse Reinforcement Learning [66.68997022043075]
Key challenge in Bayesian IRL is bridging the computational gap between the hypothesis space of possible rewards and the likelihood. We propose ValueWalk - a new Markov chain Monte Carlo method based on this insight.
arXiv Detail & Related papers (2024-07-15T17:59:52Z)
STARC: A General Framework For Quantifying Differences Between Reward Functions [55.33869271912095]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret. We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z)
Markov Abstractions for PAC Reinforcement Learning in Non-Markov Decision Processes [90.53326983143644]
We show that Markov abstractions can be learned during reinforcement learning. We show that our approach has PAC guarantees when the employed algorithms have PAC guarantees.
arXiv Detail & Related papers (2022-04-29T16:53:00Z)
On the Expressivity of Markov Reward [89.96685777114456]
This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of "task" that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories.
arXiv Detail & Related papers (2021-11-01T12:12:16Z)
Learning Probabilistic Reward Machines from Non-Markovian Stochastic Reward Processes [8.800797834097764]
We introduce probabilistic reward machines (PRMs) as a representation of non-Markovian rewards. We present an algorithm to learn PRMs from the underlying decision process as well as to learn the PRM representation of a given decision-making policy.
arXiv Detail & Related papers (2021-07-09T19:00:39Z)
Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification [133.20816939521941]
In the standard Markov decision process formalism, users specify tasks by writing down a reward function. In many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved. Motivated by this observation, we derive a control algorithm that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states.
arXiv Detail & Related papers (2021-03-23T16:19:55Z)
Reward Propagation Using Graph Convolutional Networks [61.32891095232801]
We propose a new framework for learning potential functions by leveraging ideas from graph representation learning. Our approach relies on Graph Convolutional Networks which we use as a key ingredient in combination with the probabilistic inference view of reinforcement learning.
arXiv Detail & Related papers (2020-10-06T04:38:16Z)
Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning [22.242379207077217]
We show how to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies. First, we propose reward machines, a type of finite state machine that supports the specification of reward functions. We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning.
arXiv Detail & Related papers (2020-10-06T00:10:16Z)
Online Learning of Non-Markovian Reward Models [2.064612766965483]
We consider a non-Markovian reward decision process (MDP) that models the dynamics of the environment in which the agent evolves. While the MDP is known by the agent, the reward function is unknown to the agent and must be learned. We use Angluin's $L*$ active learning algorithm to learn a Mealy machine representing the underlying non-Markovian reward machine.
arXiv Detail & Related papers (2020-09-26T13:54:34Z)
Learning Non-Markovian Reward Models in MDPs [0.0]
We show how to formalise the non-Markovian reward function using a Mealy machine. In our formal setting, we consider a Markov decision process (MDP) that models the dynamic of the environment in which the agent evolves. While the MDP is known by the agent, the reward function is unknown from the agent and must be learnt.
arXiv Detail & Related papers (2020-01-25T10:51:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.