A Reminder of its Brittleness: Language Reward Shaping May Hinder
Learning for Instruction Following Agents
- URL: http://arxiv.org/abs/2305.16621v2
- Date: Thu, 17 Aug 2023 06:11:14 GMT
- Title: A Reminder of its Brittleness: Language Reward Shaping May Hinder
Learning for Instruction Following Agents
- Authors: Sukai Huang, Nir Lipovetzky and Trevor Cohn
- Abstract summary: We argue that the apparent success of LRS is brittle, and prior positive findings can be attributed to weak RL baselines.
We provided theoretical and empirical evidence that agents trained using LRS rewards converge more slowly compared to pure RL agents.
- Score: 38.928166383780535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Teaching agents to follow complex written instructions has been an important
yet elusive goal. One technique for enhancing learning efficiency is language
reward shaping (LRS). Within a reinforcement learning (RL) framework, LRS
involves training a reward function that rewards behaviours precisely aligned
with given language instructions. We argue that the apparent success of LRS is
brittle, and prior positive findings can be attributed to weak RL baselines.
Specifically, we identified suboptimal LRS designs that reward partially
matched trajectories, and we characterised a novel reward perturbation to
capture this issue using the concept of loosening task constraints. We provided
theoretical and empirical evidence that agents trained using LRS rewards
converge more slowly compared to pure RL agents. Our work highlights the
brittleness of existing LRS methods, which has been overlooked in the previous
studies.
Related papers
- Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z) - Harnessing the Power of Reinforcement Learning for Language-Model-Based Information Retriever via Query-Document Co-Augmentation [35.70731674603417]
Large Language Models (LLMs) can be used to augment both user queries and corpus documents.<n>We present an LLM-based retriever empowered to augment both user queries and corpus documents.<n>Our approach significantly enhances LLM-based retrieval performance in both sparse and dense settings.
arXiv Detail & Related papers (2025-06-23T14:14:43Z) - No Free Lunch: Rethinking Internal Feedback for LLM Reasoning [12.881043910316787]
Reinforcement learning has emerged as a powerful paradigm for post-training large language models (LLMs) to improve reasoning.<n>We investigate an alternative class of methods, Reinforcement Learning from Internal Feedback (RLIF), which relies solely on intrinsic model-derived signals instead of external rewards.
arXiv Detail & Related papers (2025-06-20T17:59:52Z) - Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning [52.32193550674408]
We aim to improve the reasoning capabilities of language models via reinforcement learning (RL)<n>We propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually.<n>E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B)
arXiv Detail & Related papers (2025-06-07T02:41:54Z) - Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models [26.401130750061323]
Chain-of-thought (CoT) is expected to universally improve capabilities of large language models (LLMs)<n>We propose RAIF, a systematic method to boost LLMs in dealing with complex instructions via incentivizing reasoning for test-time compute scaling.<n>We address the shallow, non-essential nature of reasoning under complex instructions via sample-wise contrast for superior CoT enforcement.
arXiv Detail & Related papers (2025-06-02T08:11:44Z) - SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data [65.56911325914582]
We propose Self-play Reinforcement Learning (SeRL) to bootstrap Large Language Models (LLMs) training with limited initial data.<n>The proposed SeRL yields results superior to its counterparts and achieves performance on par with those obtained by high-quality data with verifiable rewards.
arXiv Detail & Related papers (2025-05-25T13:28:04Z) - Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning [23.99454995087634]
We explore the potential of rule-based reinforcement learning in large reasoning models.
We use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification.
Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus.
arXiv Detail & Related papers (2025-02-20T17:49:26Z) - Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning [45.30569353687124]
We introduce LaRe, a novel LLM-empowered symbolic-based decision-making framework to improve credit assignment.
Key to LaRe is the concept of the Latent Reward, which works as a multi-dimensional performance evaluation.
LaRe achieves superior temporal credit assignment to SOTA methods, (ii) excels in allocating contributions among multiple agents, and (iii) outperforms policies trained with ground truth rewards for certain tasks.
arXiv Detail & Related papers (2024-12-15T08:51:14Z) - Exploring RL-based LLM Training for Formal Language Tasks with Programmed Rewards [49.7719149179179]
This paper investigates the feasibility of using PPO for reinforcement learning (RL) from explicitly programmed reward signals.
We focus on tasks expressed through formal languages, such as programming, where explicit reward functions can be programmed to automatically assess quality of generated outputs.
Our results show that pure RL-based training for the two formal language tasks is challenging, with success being limited even for the simple arithmetic task.
arXiv Detail & Related papers (2024-10-22T15:59:58Z) - LLMs Are In-Context Reinforcement Learners [30.192422586838997]
Large Language Models (LLMs) can learn new tasks through in-context supervised learning (i.e., ICL)
This work studies if this ability extends to in-context reinforcement learning (ICRL), where models are not given gold labels in context, but only their past predictions and rewards.
We propose an algorithm to address this deficiency by increasing test-time compute, as well as a compute-bound approximation.
arXiv Detail & Related papers (2024-10-07T17:45:00Z) - Towards Learning Abductive Reasoning using VSA Distributed Representations [56.31867341825068]
We introduce the Abductive Rule Learner with Context-awareness (ARLC) model.
ARLC features a novel and more broadly applicable training objective for abductive reasoning.
We show ARLC's robustness to post-programming training by incrementally learning from examples on top of programmed knowledge.
arXiv Detail & Related papers (2024-06-27T12:05:55Z) - FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning [18.60627708199452]
We investigate how to leverage pre-trained visual-language models (VLM) for online Reinforcement Learning (RL)
We first identify the problem of reward misalignment when applying VLM as a reward in RL tasks.
We introduce a lightweight fine-tuning method, named Fuzzy VLM reward-aided RL (FuRL)
arXiv Detail & Related papers (2024-06-02T07:20:08Z) - RLSF: Reinforcement Learning via Symbolic Feedback [11.407319705797242]
We propose a new fine-tuning paradigm we refer to as Reinforcement Learning via proofs Feedback (RLSF)
In RLSF, the LLM being fine-tuned is considered an RL agent, while the environment is allowed access to reasoning or domain knowledge tools.
We show that our RLSF-based fine-tuning of LLMs outperforms traditional approaches on five different applications.
arXiv Detail & Related papers (2024-05-26T18:49:59Z) - Recall, Retrieve and Reason: Towards Better In-Context Relation Extraction [11.535892987373947]
Relation extraction (RE) aims to identify relations between entities mentioned in texts.
Large language models (LLMs) have demonstrated impressive in-context learning abilities in various tasks.
LLMs suffer from poor performances compared to most supervised fine-tuned RE methods.
arXiv Detail & Related papers (2024-04-27T07:12:52Z) - Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint [104.53687944498155]
Reinforcement learning (RL) has been widely used in training large language models (LLMs)
We propose a new RL method named RLMEC that incorporates a generative model as the reward model.
Based on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process.
arXiv Detail & Related papers (2024-01-11T17:58:41Z) - Leveraging Reward Consistency for Interpretable Feature Discovery in
Reinforcement Learning [69.19840497497503]
It is argued that the commonly used action matching principle is more like an explanation of deep neural networks (DNNs) than the interpretation of RL agents.
We propose to consider rewards, the essential objective of RL agents, as the essential objective of interpreting RL agents.
We verify and evaluate our method on the Atari 2600 games as well as Duckietown, a challenging self-driving car simulator environment.
arXiv Detail & Related papers (2023-09-04T09:09:54Z) - Language Reward Modulation for Pretraining Reinforcement Learning [61.76572261146311]
We propose leveraging the capabilities of LRFs as a pretraining signal for reinforcement learning.
Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks.
arXiv Detail & Related papers (2023-08-23T17:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.