Online Preference-based Reinforcement Learning with Self-augmented Feedback from Large Language Model
- URL: http://arxiv.org/abs/2412.16878v1
- Date: Sun, 22 Dec 2024 06:15:25 GMT
- Title: Online Preference-based Reinforcement Learning with Self-augmented Feedback from Large Language Model
- Authors: Songjun Tu, Jingbo Sun, Qichao Zhang, Xiangyuan Lan, Dongbin Zhao,
- Abstract summary: Preference-based reinforcement learning (PbRL) provides a powerful paradigm to avoid meticulous reward engineering by learning rewards based on human preferences.
In this paper, we propose a RL Self-augmented Large Language Model Feedback (RL-SaLLM-F) technique that does not rely on privileged information for online PbRL.
- Score: 17.4036850872656
- License:
- Abstract: Preference-based reinforcement learning (PbRL) provides a powerful paradigm to avoid meticulous reward engineering by learning rewards based on human preferences. However, real-time human feedback is hard to obtain in online tasks. Most work suppose there is a "scripted teacher" that utilizes privileged predefined reward to provide preference feedback. In this paper, we propose a RL Self-augmented Large Language Model Feedback (RL-SaLLM-F) technique that does not rely on privileged information for online PbRL. RL-SaLLM-F leverages the reflective and discriminative capabilities of LLM to generate self-augmented trajectories and provide preference labels for reward learning. First, we identify an failure issue in LLM-based preference discrimination, specifically "query ambiguity", in online PbRL. Then LLM is employed to provide preference labels and generate self-augmented imagined trajectories that better achieve the task goal, thereby enhancing the quality and efficiency of feedback. Additionally, a double-check mechanism is introduced to mitigate randomness in the preference labels, improving the reliability of LLM feedback. The experiment across multiple tasks in the MetaWorld benchmark demonstrates the specific contributions of each proposed module in RL-SaLLM-F, and shows that self-augmented LLM feedback can effectively replace the impractical "scripted teacher" feedback. In summary, RL-SaLLM-F introduces a new direction of feedback acquisition in online PbRL that does not rely on any online privileged information, offering an efficient and lightweight solution with LLM-driven feedback.
Related papers
- Large Language Model driven Policy Exploration for Recommender Systems [50.70228564385797]
offline RL policies trained on static user data are vulnerable to distribution shift when deployed in dynamic online environments.
Online RL-based RS also face challenges in production deployment due to the risks of exposing users to untrained or unstable policies.
Large Language Models (LLMs) offer a promising solution to mimic user objectives and preferences for pre-training policies offline.
We propose an Interaction-Augmented Learned Policy (iALP) that utilizes user preferences distilled from an LLM.
arXiv Detail & Related papers (2025-01-23T16:37:44Z) - Listwise Reward Estimation for Offline Preference-based Reinforcement Learning [20.151932308777553]
Listwise Reward Estimation (LiRE) is a novel approach for offline Preference-based Reinforcement Learning (PbRL)
LiRE builds on existing PbRL methods by constructing a Ranked List of Trajectories (RLT)
Our experiments demonstrate the superiority of LiRE, even with modest feedback budgets and enjoying robustness with respect to the number of feedbacks and feedback noise.
arXiv Detail & Related papers (2024-08-08T03:18:42Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - RLSF: Reinforcement Learning via Symbolic Feedback [11.407319705797242]
We propose a new fine-tuning paradigm we refer to as Reinforcement Learning via proofs Feedback (RLSF)
In RLSF, the LLM being fine-tuned is considered an RL agent, while the environment is allowed access to reasoning or domain knowledge tools.
We show that our RLSF-based fine-tuning of LLMs outperforms traditional approaches on five different applications.
arXiv Detail & Related papers (2024-05-26T18:49:59Z) - Reinforcement Learning from Reflective Feedback (RLRF): Aligning and Improving LLMs via Fine-Grained Self-Reflection [24.435121488662897]
We propose a novel framework: Reinforcement Learning from Reflective Feedback (RLRF)
RLRF employs a self-reflection mechanism to systematically explore and refine LLM responses, then fine-tuning the models via a RL algorithm along with promising responses.
Our experiments across Just-Eval, Factuality, and Mathematical Reasoning demonstrate the efficacy and transformative potential of RLRF.
arXiv Detail & Related papers (2024-03-21T08:57:27Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Reflect-RL: Two-Player Online RL Fine-Tuning for LMs [38.5495318990769]
We propose Reflect-RL, a system to fine-tune language models (LMs) using online reinforcement learning (RL) and supervised fine-tuning (SFT)
Test results indicate GPT-2 XL 1.56B fine-tuned with Reflect-RL outperforms larger open-source LMs, such as Mistral 7B.
arXiv Detail & Related papers (2024-02-20T01:04:21Z) - Direct Language Model Alignment from Online AI Feedback [78.40436231613754]
Direct alignment from preferences (DAP) methods have recently emerged as efficient alternatives to reinforcement learning from human feedback (RLHF)
In this study, we posit that online feedback is key and improves DAP methods.
Our method, online AI feedback (OAIF) uses an LLM as annotator: on each training, we sample two responses from the current model and prompt the LLM annotator to choose which one is preferred, thus providing online feedback.
arXiv Detail & Related papers (2024-02-07T12:31:13Z) - Reinforcement Learning from LLM Feedback to Counteract Goal
Misgeneralization [0.0]
We introduce a method to address goal misgeneralization in reinforcement learning (RL)
Goal misgeneralization occurs when an agent retains its capabilities out-of-distribution yet pursues a proxy rather than the intended one.
This study demonstrates how the Large Language Model can efficiently supervise RL agents.
arXiv Detail & Related papers (2024-01-14T01:09:48Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - Reinforced Self-Training (ReST) for Language Modeling [56.75447441157628]
Reinforcement learning from human feedback (RLHF) can improve the quality of large language model's (LLM) outputs by aligning them with human preferences.
We propose a simple algorithm for aligning LLMs with human preferences inspired by growing batch reinforcement learning (RL), which we call Reinforced Self-Training (ReST)
Our results show that ReST can substantially improve translation quality, as measured by automated metrics and human evaluation on machine translation benchmarks in a compute and sample-efficient manner.
arXiv Detail & Related papers (2023-08-17T14:12:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.