Reward Shaping for Reinforcement Learning with Omega-Regular Objectives
- URL: http://arxiv.org/abs/2001.05977v1
- Date: Thu, 16 Jan 2020 18:22:50 GMT
- Title: Reward Shaping for Reinforcement Learning with Omega-Regular Objectives
- Authors: E. M. Hahn, M. Perez, S. Schewe, F. Somenzi, A. Trivedi, D. Wojtczak
- Abstract summary: We exploit good-for-MDPs automata for model free reinforcement learning.
The drawback of this translation is that the rewards are, on average, reaped very late.
We devise a new reward shaping approach that overcomes this issue.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, successful approaches have been made to exploit good-for-MDPs
automata (B\"uchi automata with a restricted form of nondeterminism) for model
free reinforcement learning, a class of automata that subsumes good for games
automata and the most widespread class of limit deterministic automata. The
foundation of using these B\"uchi automata is that the B\"uchi condition can,
for good-for-MDP automata, be translated to reachability.
The drawback of this translation is that the rewards are, on average, reaped
very late, which requires long episodes during the learning process. We devise
a new reward shaping approach that overcomes this issue. We show that the
resulting model is equivalent to a discounted payoff objective with a biased
discount that simplifies and improves on prior work in this direction.
Related papers
- Walking the Values in Bayesian Inverse Reinforcement Learning [66.68997022043075]
Key challenge in Bayesian IRL is bridging the computational gap between the hypothesis space of possible rewards and the likelihood.
We propose ValueWalk - a new Markov chain Monte Carlo method based on this insight.
arXiv Detail & Related papers (2024-07-15T17:59:52Z) - Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output.
We use these attention weights to redistribute the reward along the whole completion.
Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z) - REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Language Model Alignment with Elastic Reset [8.503863369800191]
We argue that commonly-used test metrics are insufficient to measure how different algorithms tradeoff between reward and drift.
We propose Elastic Reset, a new algorithm that achieves higher reward with less drift without explicitly modifying the training objective.
We demonstrate that fine-tuning language models with Elastic Reset leads to state-of-the-art performance on a small scale pivot-translation benchmark.
arXiv Detail & Related papers (2023-12-06T22:53:34Z) - CAME: Contrastive Automated Model Evaluation [12.879345202312628]
Contrastive Automatic Model Evaluation (CAME) is a novel AutoEval framework that is rid of involving training set in the loop.
CAME establishes a new SOTA results for AutoEval by surpassing prior work significantly.
arXiv Detail & Related papers (2023-08-22T01:24:14Z) - Alternating Good-for-MDP Automata [4.429642479975602]
We show that it is possible to repair bad-for-MDPs (GFM) automata by using good-for-MDPs (GFM) B"uchi automata.
A translation to nondeterministic Rabin or B"uchi automata comes at an exponential cost, even without requiring the target automaton to be good-for-MDPs.
The surprising answer is that we have to pay significantly less when we instead expand the good-for-MDP property to alternating automata.
arXiv Detail & Related papers (2022-05-06T14:01:47Z) - Model-Free Reinforcement Learning for Symbolic Automata-encoded
Objectives [0.0]
Reinforcement learning (RL) is a popular approach for robotic path planning in uncertain environments.
Poorly designed rewards can lead to policies that do get maximal rewards but fail to satisfy desired task objectives or are unsafe.
We propose using formal specifications in the form of symbolic automata.
arXiv Detail & Related papers (2022-02-04T21:54:36Z) - Model-Augmented Q-learning [112.86795579978802]
We propose a MFRL framework that is augmented with the components of model-based RL.
Specifically, we propose to estimate not only the $Q$-values but also both the transition and the reward with a shared network.
We show that the proposed scheme, called Model-augmented $Q$-learning (MQL), obtains a policy-invariant solution which is identical to the solution obtained by learning with true reward.
arXiv Detail & Related papers (2021-02-07T17:56:50Z) - Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious.
We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data.
In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z) - Induction and Exploitation of Subgoal Automata for Reinforcement
Learning [75.55324974788475]
We present ISA, an approach for learning and exploiting subgoals in episodic reinforcement learning (RL) tasks.
ISA interleaves reinforcement learning with the induction of a subgoal automaton, an automaton whose edges are labeled by the task's subgoals.
A subgoal automaton also consists of two special states: a state indicating the successful completion of the task, and a state indicating that the task has finished without succeeding.
arXiv Detail & Related papers (2020-09-08T16:42:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.