Reward Shaping for Reinforcement Learning with Omega-Regular Objectives
- URL: http://arxiv.org/abs/2001.05977v1
- Date: Thu, 16 Jan 2020 18:22:50 GMT
- Title: Reward Shaping for Reinforcement Learning with Omega-Regular Objectives
- Authors: E. M. Hahn, M. Perez, S. Schewe, F. Somenzi, A. Trivedi, D. Wojtczak
- Abstract summary: We exploit good-for-MDPs automata for model free reinforcement learning.
The drawback of this translation is that the rewards are, on average, reaped very late.
We devise a new reward shaping approach that overcomes this issue.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, successful approaches have been made to exploit good-for-MDPs
automata (B\"uchi automata with a restricted form of nondeterminism) for model
free reinforcement learning, a class of automata that subsumes good for games
automata and the most widespread class of limit deterministic automata. The
foundation of using these B\"uchi automata is that the B\"uchi condition can,
for good-for-MDP automata, be translated to reachability.
The drawback of this translation is that the rewards are, on average, reaped
very late, which requires long episodes during the learning process. We devise
a new reward shaping approach that overcomes this issue. We show that the
resulting model is equivalent to a discounted payoff objective with a biased
discount that simplifies and improves on prior work in this direction.
Related papers
- Aligning Large Language Models via Self-Steering Optimization [78.42826116686435]
We introduce Self-Steering Optimization ($SSO$), an algorithm that autonomously generates high-quality preference signals.
$SSO$ maintains the accuracy of signals by ensuring a consistent gap between chosen and rejected responses.
We validate the effectiveness of $SSO$ with two foundation models, Qwen2 and Llama3.1, indicating that it provides accurate, on-policy preference signals.
arXiv Detail & Related papers (2024-10-22T16:04:03Z) - ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [50.45155830888697]
ReST-MCTS* integrates process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces.
We first show that the tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior LLM reasoning baselines.
We then show that by using traces searched by this tree-search policy as training data, we can continuously enhance the three language models for multiple iterations.
arXiv Detail & Related papers (2024-06-06T07:40:00Z) - Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output.
We use these attention weights to redistribute the reward along the whole completion.
Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z) - REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Language Model Alignment with Elastic Reset [8.503863369800191]
We argue that commonly-used test metrics are insufficient to measure how different algorithms tradeoff between reward and drift.
We propose Elastic Reset, a new algorithm that achieves higher reward with less drift without explicitly modifying the training objective.
We demonstrate that fine-tuning language models with Elastic Reset leads to state-of-the-art performance on a small scale pivot-translation benchmark.
arXiv Detail & Related papers (2023-12-06T22:53:34Z) - CAME: Contrastive Automated Model Evaluation [12.879345202312628]
Contrastive Automatic Model Evaluation (CAME) is a novel AutoEval framework that is rid of involving training set in the loop.
CAME establishes a new SOTA results for AutoEval by surpassing prior work significantly.
arXiv Detail & Related papers (2023-08-22T01:24:14Z) - Alternating Good-for-MDP Automata [4.429642479975602]
We show that it is possible to repair bad-for-MDPs (GFM) automata by using good-for-MDPs (GFM) B"uchi automata.
A translation to nondeterministic Rabin or B"uchi automata comes at an exponential cost, even without requiring the target automaton to be good-for-MDPs.
The surprising answer is that we have to pay significantly less when we instead expand the good-for-MDP property to alternating automata.
arXiv Detail & Related papers (2022-05-06T14:01:47Z) - Model-Free Reinforcement Learning for Symbolic Automata-encoded
Objectives [0.0]
Reinforcement learning (RL) is a popular approach for robotic path planning in uncertain environments.
Poorly designed rewards can lead to policies that do get maximal rewards but fail to satisfy desired task objectives or are unsafe.
We propose using formal specifications in the form of symbolic automata.
arXiv Detail & Related papers (2022-02-04T21:54:36Z) - Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious.
We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data.
In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z) - Induction and Exploitation of Subgoal Automata for Reinforcement
Learning [75.55324974788475]
We present ISA, an approach for learning and exploiting subgoals in episodic reinforcement learning (RL) tasks.
ISA interleaves reinforcement learning with the induction of a subgoal automaton, an automaton whose edges are labeled by the task's subgoals.
A subgoal automaton also consists of two special states: a state indicating the successful completion of the task, and a state indicating that the task has finished without succeeding.
arXiv Detail & Related papers (2020-09-08T16:42:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.