Related papers: A Reminder of its Brittleness: Language Reward Shaping May Hinder Learning for Instruction Following Agents

A Reminder of its Brittleness: Language Reward Shaping May Hinder Learning for Instruction Following Agents

URL: http://arxiv.org/abs/2305.16621v2
Date: Thu, 17 Aug 2023 06:11:14 GMT
Title: A Reminder of its Brittleness: Language Reward Shaping May Hinder Learning for Instruction Following Agents
Authors: Sukai Huang, Nir Lipovetzky and Trevor Cohn
Abstract summary: We argue that the apparent success of LRS is brittle, and prior positive findings can be attributed to weak RL baselines. We provided theoretical and empirical evidence that agents trained using LRS rewards converge more slowly compared to pure RL agents.
Score: 38.928166383780535
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Teaching agents to follow complex written instructions has been an important yet elusive goal. One technique for enhancing learning efficiency is language reward shaping (LRS). Within a reinforcement learning (RL) framework, LRS involves training a reward function that rewards behaviours precisely aligned with given language instructions. We argue that the apparent success of LRS is brittle, and prior positive findings can be attributed to weak RL baselines. Specifically, we identified suboptimal LRS designs that reward partially matched trajectories, and we characterised a novel reward perturbation to capture this issue using the concept of loosening task constraints. We provided theoretical and empirical evidence that agents trained using LRS rewards converge more slowly compared to pure RL agents. Our work highlights the brittleness of existing LRS methods, which has been overlooked in the previous studies.

Related papers

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning [23.99454995087634]
We explore the potential of rule-based reinforcement learning in large reasoning models. We use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus.
arXiv Detail & Related papers (2025-02-20T17:49:26Z)
Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning [45.30569353687124]
We introduce LaRe, a novel LLM-empowered symbolic-based decision-making framework to improve credit assignment. Key to LaRe is the concept of the Latent Reward, which works as a multi-dimensional performance evaluation. LaRe achieves superior temporal credit assignment to SOTA methods, (ii) excels in allocating contributions among multiple agents, and (iii) outperforms policies trained with ground truth rewards for certain tasks.
arXiv Detail & Related papers (2024-12-15T08:51:14Z)
Exploring RL-based LLM Training for Formal Language Tasks with Programmed Rewards [49.7719149179179]
This paper investigates the feasibility of using PPO for reinforcement learning (RL) from explicitly programmed reward signals. We focus on tasks expressed through formal languages, such as programming, where explicit reward functions can be programmed to automatically assess quality of generated outputs. Our results show that pure RL-based training for the two formal language tasks is challenging, with success being limited even for the simple arithmetic task.
arXiv Detail & Related papers (2024-10-22T15:59:58Z)
LLMs Are In-Context Reinforcement Learners [30.192422586838997]
Large Language Models (LLMs) can learn new tasks through in-context supervised learning (i.e., ICL) This work studies if this ability extends to in-context reinforcement learning (ICRL), where models are not given gold labels in context, but only their past predictions and rewards. We propose an algorithm to address this deficiency by increasing test-time compute, as well as a compute-bound approximation.
arXiv Detail & Related papers (2024-10-07T17:45:00Z)
Towards Learning Abductive Reasoning using VSA Distributed Representations [56.31867341825068]
We introduce the Abductive Rule Learner with Context-awareness (ARLC) model. ARLC features a novel and more broadly applicable training objective for abductive reasoning. We show ARLC's robustness to post-programming training by incrementally learning from examples on top of programmed knowledge.
arXiv Detail & Related papers (2024-06-27T12:05:55Z)
FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning [18.60627708199452]
We investigate how to leverage pre-trained visual-language models (VLM) for online Reinforcement Learning (RL) We first identify the problem of reward misalignment when applying VLM as a reward in RL tasks. We introduce a lightweight fine-tuning method, named Fuzzy VLM reward-aided RL (FuRL)
arXiv Detail & Related papers (2024-06-02T07:20:08Z)
RLSF: Reinforcement Learning via Symbolic Feedback [11.407319705797242]
We propose a new fine-tuning paradigm we refer to as Reinforcement Learning via proofs Feedback (RLSF) In RLSF, the LLM being fine-tuned is considered an RL agent, while the environment is allowed access to reasoning or domain knowledge tools. We show that our RLSF-based fine-tuning of LLMs outperforms traditional approaches on five different applications.
arXiv Detail & Related papers (2024-05-26T18:49:59Z)
Recall, Retrieve and Reason: Towards Better In-Context Relation Extraction [11.535892987373947]
Relation extraction (RE) aims to identify relations between entities mentioned in texts. Large language models (LLMs) have demonstrated impressive in-context learning abilities in various tasks. LLMs suffer from poor performances compared to most supervised fine-tuned RE methods.
arXiv Detail & Related papers (2024-04-27T07:12:52Z)
Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint [104.53687944498155]
Reinforcement learning (RL) has been widely used in training large language models (LLMs) We propose a new RL method named RLMEC that incorporates a generative model as the reward model. Based on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process.
arXiv Detail & Related papers (2024-01-11T17:58:41Z)
Leveraging Reward Consistency for Interpretable Feature Discovery in Reinforcement Learning [69.19840497497503]
It is argued that the commonly used action matching principle is more like an explanation of deep neural networks (DNNs) than the interpretation of RL agents. We propose to consider rewards, the essential objective of RL agents, as the essential objective of interpreting RL agents. We verify and evaluate our method on the Atari 2600 games as well as Duckietown, a challenging self-driving car simulator environment.
arXiv Detail & Related papers (2023-09-04T09:09:54Z)
Language Reward Modulation for Pretraining Reinforcement Learning [61.76572261146311]
We propose leveraging the capabilities of LRFs as a pretraining signal for reinforcement learning. Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks.
arXiv Detail & Related papers (2023-08-23T17:37:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.