Related papers: Reward Shaping for Reinforcement Learning with Omega-Regular Objectives

Reward Shaping for Reinforcement Learning with Omega-Regular Objectives

URL: http://arxiv.org/abs/2001.05977v1
Date: Thu, 16 Jan 2020 18:22:50 GMT
Title: Reward Shaping for Reinforcement Learning with Omega-Regular Objectives
Authors: E. M. Hahn, M. Perez, S. Schewe, F. Somenzi, A. Trivedi, D. Wojtczak
Abstract summary: We exploit good-for-MDPs automata for model free reinforcement learning. The drawback of this translation is that the rewards are, on average, reaped very late. We devise a new reward shaping approach that overcomes this issue.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, successful approaches have been made to exploit good-for-MDPs automata (B\"uchi automata with a restricted form of nondeterminism) for model free reinforcement learning, a class of automata that subsumes good for games automata and the most widespread class of limit deterministic automata. The foundation of using these B\"uchi automata is that the B\"uchi condition can, for good-for-MDP automata, be translated to reachability. The drawback of this translation is that the rewards are, on average, reaped very late, which requires long episodes during the learning process. We devise a new reward shaping approach that overcomes this issue. We show that the resulting model is equivalent to a discounted payoff objective with a biased discount that simplifies and improves on prior work in this direction.

Related papers

Learning Quantitative Automata Modulo Theories [17.33092604696224]
We present QUINTIC, an active learning algorithm, wherein the learner infers a valid automaton through deductive reasoning. Our evaluations utilize theory of rationals in order to learn summation, discounted summation, product, and classification quantitative automata.
arXiv Detail & Related papers (2024-11-15T21:51:14Z)
Aligning Large Language Models via Self-Steering Optimization [78.42826116686435]
We introduce Self-Steering Optimization ($SSO$), an algorithm that autonomously generates high-quality preference signals. $SSO$ maintains the accuracy of signals by ensuring a consistent gap between chosen and rejected responses. We validate the effectiveness of $SSO$ with two foundation models, Qwen2 and Llama3.1, indicating that it provides accurate, on-policy preference signals.
arXiv Detail & Related papers (2024-10-22T16:04:03Z)
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [50.45155830888697]
We develop a reinforced self-training approach, called ReST-MCTS*, based on integrating process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces as well as per-step value to train policy and reward models. We first show that the tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior LLM reasoning baselines such as Best-of-N and Tree-of-Thought, within the same search budget.
arXiv Detail & Related papers (2024-06-06T07:40:00Z)
Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output. We use these attention weights to redistribute the reward along the whole completion. Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z)
REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world. Current methods to mitigate this misalignment work by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Language Model Alignment with Elastic Reset [8.503863369800191]
We argue that commonly-used test metrics are insufficient to measure how different algorithms tradeoff between reward and drift. We propose Elastic Reset, a new algorithm that achieves higher reward with less drift without explicitly modifying the training objective. We demonstrate that fine-tuning language models with Elastic Reset leads to state-of-the-art performance on a small scale pivot-translation benchmark.
arXiv Detail & Related papers (2023-12-06T22:53:34Z)
Alternating Good-for-MDP Automata [4.429642479975602]
We show that it is possible to repair bad-for-MDPs (GFM) automata by using good-for-MDPs (GFM) B"uchi automata. A translation to nondeterministic Rabin or B"uchi automata comes at an exponential cost, even without requiring the target automaton to be good-for-MDPs. The surprising answer is that we have to pay significantly less when we instead expand the good-for-MDP property to alternating automata.
arXiv Detail & Related papers (2022-05-06T14:01:47Z)
Model-Free Reinforcement Learning for Symbolic Automata-encoded Objectives [0.0]
Reinforcement learning (RL) is a popular approach for robotic path planning in uncertain environments. Poorly designed rewards can lead to policies that do get maximal rewards but fail to satisfy desired task objectives or are unsafe. We propose using formal specifications in the form of symbolic automata.
arXiv Detail & Related papers (2022-02-04T21:54:36Z)
Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious. We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data. In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z)
Induction and Exploitation of Subgoal Automata for Reinforcement Learning [75.55324974788475]
We present ISA, an approach for learning and exploiting subgoals in episodic reinforcement learning (RL) tasks. ISA interleaves reinforcement learning with the induction of a subgoal automaton, an automaton whose edges are labeled by the task's subgoals. A subgoal automaton also consists of two special states: a state indicating the successful completion of the task, and a state indicating that the task has finished without succeeding.
arXiv Detail & Related papers (2020-09-08T16:42:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.