A Reminder of its Brittleness: Language Reward Shaping May Hinder
Learning for Instruction Following Agents
- URL: http://arxiv.org/abs/2305.16621v2
- Date: Thu, 17 Aug 2023 06:11:14 GMT
- Title: A Reminder of its Brittleness: Language Reward Shaping May Hinder
Learning for Instruction Following Agents
- Authors: Sukai Huang, Nir Lipovetzky and Trevor Cohn
- Abstract summary: We argue that the apparent success of LRS is brittle, and prior positive findings can be attributed to weak RL baselines.
We provided theoretical and empirical evidence that agents trained using LRS rewards converge more slowly compared to pure RL agents.
- Score: 38.928166383780535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Teaching agents to follow complex written instructions has been an important
yet elusive goal. One technique for enhancing learning efficiency is language
reward shaping (LRS). Within a reinforcement learning (RL) framework, LRS
involves training a reward function that rewards behaviours precisely aligned
with given language instructions. We argue that the apparent success of LRS is
brittle, and prior positive findings can be attributed to weak RL baselines.
Specifically, we identified suboptimal LRS designs that reward partially
matched trajectories, and we characterised a novel reward perturbation to
capture this issue using the concept of loosening task constraints. We provided
theoretical and empirical evidence that agents trained using LRS rewards
converge more slowly compared to pure RL agents. Our work highlights the
brittleness of existing LRS methods, which has been overlooked in the previous
studies.
Related papers
- Exploring RL-based LLM Training for Formal Language Tasks with Programmed Rewards [49.7719149179179]
This paper investigates the feasibility of using PPO for reinforcement learning (RL) from explicitly programmed reward signals.
We focus on tasks expressed through formal languages, such as programming, where explicit reward functions can be programmed to automatically assess quality of generated outputs.
Our results show that pure RL-based training for the two formal language tasks is challenging, with success being limited even for the simple arithmetic task.
arXiv Detail & Related papers (2024-10-22T15:59:58Z) - LLMs Are In-Context Reinforcement Learners [30.192422586838997]
Large Language Models (LLMs) can learn new tasks through in-context supervised learning (i.e., ICL)
This work studies if this ability extends to in-context reinforcement learning (ICRL), where models are not given gold labels in context, but only their past predictions and rewards.
We propose an algorithm to address this deficiency by increasing test-time compute, as well as a compute-bound approximation.
arXiv Detail & Related papers (2024-10-07T17:45:00Z) - Towards Learning Abductive Reasoning using VSA Distributed Representations [56.31867341825068]
We introduce the Abductive Rule Learner with Context-awareness (ARLC) model.
ARLC features a novel and more broadly applicable training objective for abductive reasoning.
We show ARLC's robustness to post-programming training by incrementally learning from examples on top of programmed knowledge.
arXiv Detail & Related papers (2024-06-27T12:05:55Z) - FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning [18.60627708199452]
We investigate how to leverage pre-trained visual-language models (VLM) for online Reinforcement Learning (RL)
We first identify the problem of reward misalignment when applying VLM as a reward in RL tasks.
We introduce a lightweight fine-tuning method, named Fuzzy VLM reward-aided RL (FuRL)
arXiv Detail & Related papers (2024-06-02T07:20:08Z) - RLSF: Reinforcement Learning via Symbolic Feedback [11.407319705797242]
We propose a new fine-tuning paradigm we refer to as Reinforcement Learning via proofs Feedback (RLSF)
In RLSF, the LLM being fine-tuned is considered an RL agent, while the environment is allowed access to reasoning or domain knowledge tools.
We show that our RLSF-based fine-tuning of LLMs outperforms traditional approaches on five different applications.
arXiv Detail & Related papers (2024-05-26T18:49:59Z) - Recall, Retrieve and Reason: Towards Better In-Context Relation Extraction [11.535892987373947]
Relation extraction (RE) aims to identify relations between entities mentioned in texts.
Large language models (LLMs) have demonstrated impressive in-context learning abilities in various tasks.
LLMs suffer from poor performances compared to most supervised fine-tuned RE methods.
arXiv Detail & Related papers (2024-04-27T07:12:52Z) - Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint [104.53687944498155]
Reinforcement learning (RL) has been widely used in training large language models (LLMs)
We propose a new RL method named RLMEC that incorporates a generative model as the reward model.
Based on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process.
arXiv Detail & Related papers (2024-01-11T17:58:41Z) - Leveraging Reward Consistency for Interpretable Feature Discovery in
Reinforcement Learning [69.19840497497503]
It is argued that the commonly used action matching principle is more like an explanation of deep neural networks (DNNs) than the interpretation of RL agents.
We propose to consider rewards, the essential objective of RL agents, as the essential objective of interpreting RL agents.
We verify and evaluate our method on the Atari 2600 games as well as Duckietown, a challenging self-driving car simulator environment.
arXiv Detail & Related papers (2023-09-04T09:09:54Z) - Language Reward Modulation for Pretraining Reinforcement Learning [61.76572261146311]
We propose leveraging the capabilities of LRFs as a pretraining signal for reinforcement learning.
Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks.
arXiv Detail & Related papers (2023-08-23T17:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.