Reward Shaping for Reinforcement Learning with Omega-Regular Objectives
- URL: http://arxiv.org/abs/2001.05977v1
- Date: Thu, 16 Jan 2020 18:22:50 GMT
- Title: Reward Shaping for Reinforcement Learning with Omega-Regular Objectives
- Authors: E. M. Hahn, M. Perez, S. Schewe, F. Somenzi, A. Trivedi, D. Wojtczak
- Abstract summary: We exploit good-for-MDPs automata for model free reinforcement learning.
The drawback of this translation is that the rewards are, on average, reaped very late.
We devise a new reward shaping approach that overcomes this issue.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, successful approaches have been made to exploit good-for-MDPs
automata (B\"uchi automata with a restricted form of nondeterminism) for model
free reinforcement learning, a class of automata that subsumes good for games
automata and the most widespread class of limit deterministic automata. The
foundation of using these B\"uchi automata is that the B\"uchi condition can,
for good-for-MDP automata, be translated to reachability.
The drawback of this translation is that the rewards are, on average, reaped
very late, which requires long episodes during the learning process. We devise
a new reward shaping approach that overcomes this issue. We show that the
resulting model is equivalent to a discounted payoff objective with a biased
discount that simplifies and improves on prior work in this direction.
Related papers
- Learning Quantitative Automata Modulo Theories [17.33092604696224]
We present QUINTIC, an active learning algorithm, wherein the learner infers a valid automaton through deductive reasoning.
Our evaluations utilize theory of rationals in order to learn summation, discounted summation, product, and classification quantitative automata.
arXiv Detail & Related papers (2024-11-15T21:51:14Z) - Aligning Large Language Models via Self-Steering Optimization [78.42826116686435]
We introduce Self-Steering Optimization ($SSO$), an algorithm that autonomously generates high-quality preference signals.
$SSO$ maintains the accuracy of signals by ensuring a consistent gap between chosen and rejected responses.
We validate the effectiveness of $SSO$ with two foundation models, Qwen2 and Llama3.1, indicating that it provides accurate, on-policy preference signals.
arXiv Detail & Related papers (2024-10-22T16:04:03Z) - ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [50.45155830888697]
We develop a reinforced self-training approach, called ReST-MCTS*, based on integrating process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces as well as per-step value to train policy and reward models.
We first show that the tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior LLM reasoning baselines such as Best-of-N and Tree-of-Thought, within the same search budget.
arXiv Detail & Related papers (2024-06-06T07:40:00Z) - Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output.
We use these attention weights to redistribute the reward along the whole completion.
Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z) - REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.
Recent methods aim to mitigate misalignment by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Language Model Alignment with Elastic Reset [8.503863369800191]
We argue that commonly-used test metrics are insufficient to measure how different algorithms tradeoff between reward and drift.
We propose Elastic Reset, a new algorithm that achieves higher reward with less drift without explicitly modifying the training objective.
We demonstrate that fine-tuning language models with Elastic Reset leads to state-of-the-art performance on a small scale pivot-translation benchmark.
arXiv Detail & Related papers (2023-12-06T22:53:34Z) - Alternating Good-for-MDP Automata [4.429642479975602]
We show that it is possible to repair bad-for-MDPs (GFM) automata by using good-for-MDPs (GFM) B"uchi automata.
A translation to nondeterministic Rabin or B"uchi automata comes at an exponential cost, even without requiring the target automaton to be good-for-MDPs.
The surprising answer is that we have to pay significantly less when we instead expand the good-for-MDP property to alternating automata.
arXiv Detail & Related papers (2022-05-06T14:01:47Z) - Model-Free Reinforcement Learning for Symbolic Automata-encoded
Objectives [0.0]
Reinforcement learning (RL) is a popular approach for robotic path planning in uncertain environments.
Poorly designed rewards can lead to policies that do get maximal rewards but fail to satisfy desired task objectives or are unsafe.
We propose using formal specifications in the form of symbolic automata.
arXiv Detail & Related papers (2022-02-04T21:54:36Z) - Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious.
We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data.
In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z) - Induction and Exploitation of Subgoal Automata for Reinforcement
Learning [75.55324974788475]
We present ISA, an approach for learning and exploiting subgoals in episodic reinforcement learning (RL) tasks.
ISA interleaves reinforcement learning with the induction of a subgoal automaton, an automaton whose edges are labeled by the task's subgoals.
A subgoal automaton also consists of two special states: a state indicating the successful completion of the task, and a state indicating that the task has finished without succeeding.
arXiv Detail & Related papers (2020-09-08T16:42:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.