Learning Reward Machines through Preference Queries over Sequences
- URL: http://arxiv.org/abs/2308.09301v1
- Date: Fri, 18 Aug 2023 04:49:45 GMT
- Title: Learning Reward Machines through Preference Queries over Sequences
- Authors: Eric Hsiung, Joydeep Biswas, Swarat Chaudhuri
- Abstract summary: We contribute REMAP, a novel algorithm for learning reward machines from preferences.
In addition to the proofs of correctness and termination for REMAP, we present empirical evidence measuring correctness.
- Score: 19.478224060277775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reward machines have shown great promise at capturing non-Markovian reward
functions for learning tasks that involve complex action sequencing. However,
no algorithm currently exists for learning reward machines with realistic weak
feedback in the form of preferences. We contribute REMAP, a novel algorithm for
learning reward machines from preferences, with correctness and termination
guarantees. REMAP introduces preference queries in place of membership queries
in the L* algorithm, and leverages a symbolic observation table along with
unification and constraint solving to narrow the hypothesis reward machine
search space. In addition to the proofs of correctness and termination for
REMAP, we present empirical evidence measuring correctness: how frequently the
resulting reward machine is isomorphic under a consistent yet inexact teacher,
and the regret between the ground truth and learned reward machines.
Related papers
- Detecting Hidden Triggers: Mapping Non-Markov Reward Functions to Markov [2.486161976966064]
We propose a framework for mapping non-Markov reward functions into equivalent Markov ones by learning a Reward Machine.
Unlike the general practice of learning Reward Machines, we do not require a set of high-level propositional symbols from which to learn.
We empirically validate our approach by learning black-box non-Markov Reward functions in the Officeworld Domain.
arXiv Detail & Related papers (2024-01-20T21:09:27Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - STARC: A General Framework For Quantifying Differences Between Reward
Functions [55.33869271912095]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics.
We show that STARC metrics induce both an upper and a lower bound on worst-case regret.
We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z) - Provably Efficient Representation Learning with Tractable Planning in
Low-Rank POMDP [81.00800920928621]
We study representation learning in partially observable Markov Decision Processes (POMDPs)
We first present an algorithm for decodable POMDPs that combines maximum likelihood estimation (MLE) and optimism in the face of uncertainty (OFU)
We then show how to adapt this algorithm to also work in the broader class of $gamma$-observable POMDPs.
arXiv Detail & Related papers (2023-06-21T16:04:03Z) - Reward Collapse in Aligning Large Language Models [64.98482888193267]
We study the phenomenon of textitreward collapse', an empirical observation where the prevailing ranking-based approach results in an textitidentical reward distribution.
Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
arXiv Detail & Related papers (2023-05-28T02:12:00Z) - ReAct: Temporal Action Detection with Relational Queries [84.76646044604055]
This work aims at advancing temporal action detection (TAD) using an encoder-decoder framework with action queries.
We first propose a relational attention mechanism in the decoder, which guides the attention among queries based on their relations.
Lastly, we propose to predict the localization quality of each action query at inference in order to distinguish high-quality queries.
arXiv Detail & Related papers (2022-07-14T17:46:37Z) - Delayed Rewards Calibration via Reward Empirical Sufficiency [11.089718301262433]
We introduce a delay reward calibration paradigm inspired from a classification perspective.
We define an empirical sufficient distribution, where the state vectors within the distribution will lead agents to reward signals.
A purify-trained classifier is designed to obtain the distribution and generate the calibrated rewards.
arXiv Detail & Related papers (2021-02-21T06:42:31Z) - Logically Consistent Loss for Visual Question Answering [66.83963844316561]
The current advancement in neural-network based Visual Question Answering (VQA) cannot ensure such consistency due to identically distribution (i.i.d.) assumption.
We propose a new model-agnostic logic constraint to tackle this issue by formulating a logically consistent loss in the multi-task learning framework.
Experiments confirm that the proposed loss formulae and introduction of hybrid-batch leads to more consistency as well as better performance.
arXiv Detail & Related papers (2020-11-19T20:31:05Z) - Reward Machines: Exploiting Reward Function Structure in Reinforcement
Learning [22.242379207077217]
We show how to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies.
First, we propose reward machines, a type of finite state machine that supports the specification of reward functions.
We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning.
arXiv Detail & Related papers (2020-10-06T00:10:16Z) - Online Learning of Non-Markovian Reward Models [2.064612766965483]
We consider a non-Markovian reward decision process (MDP) that models the dynamics of the environment in which the agent evolves.
While the MDP is known by the agent, the reward function is unknown to the agent and must be learned.
We use Angluin's $L*$ active learning algorithm to learn a Mealy machine representing the underlying non-Markovian reward machine.
arXiv Detail & Related papers (2020-09-26T13:54:34Z) - Learning Non-Markovian Reward Models in MDPs [0.0]
We show how to formalise the non-Markovian reward function using a Mealy machine.
In our formal setting, we consider a Markov decision process (MDP) that models the dynamic of the environment in which the agent evolves.
While the MDP is known by the agent, the reward function is unknown from the agent and must be learnt.
arXiv Detail & Related papers (2020-01-25T10:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.