Related papers: A Large Language Model-Driven Reward Design Framework via Dynamic Feedback for Reinforcement Learning

A Large Language Model-Driven Reward Design Framework via Dynamic Feedback for Reinforcement Learning

URL: http://arxiv.org/abs/2410.14660v1
Date: Fri, 18 Oct 2024 17:51:51 GMT
Title: A Large Language Model-Driven Reward Design Framework via Dynamic Feedback for Reinforcement Learning
Authors: Shengjie Sun, Runze Liu, Jiafei Lyu, Jing-Wen Yang, Liangpeng Zhang, Xiu Li,
Abstract summary: CARD is a Reward Design framework that iteratively generates and improves reward function code. CARD includes a Coder that generates and verifies the code, while a Evaluator provides dynamic feedback to guide the Coder in improving the code.
Score: 25.82540393199001
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have shown significant potential in designing reward functions for Reinforcement Learning (RL) tasks. However, obtaining high-quality reward code often involves human intervention, numerous LLM queries, or repetitive RL training. To address these issues, we propose CARD, a LLM-driven Reward Design framework that iteratively generates and improves reward function code. Specifically, CARD includes a Coder that generates and verifies the code, while a Evaluator provides dynamic feedback to guide the Coder in improving the code, eliminating the need for human feedback. In addition to process feedback and trajectory feedback, we introduce Trajectory Preference Evaluation (TPE), which evaluates the current reward function based on trajectory preferences. If the code fails the TPE, the Evaluator provides preference feedback, avoiding RL training at every iteration and making the reward function better aligned with the task objective. Empirical results on Meta-World and ManiSkill2 demonstrate that our method achieves an effective balance between task performance and token efficiency, outperforming or matching the baselines across all tasks. On 10 out of 12 tasks, CARD shows better or comparable performance to policies trained with expert-designed rewards, and our method even surpasses the oracle on 3 tasks.

Related papers

Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program [96.79600297158271]
We propose SVIP, a novel approach to train a step-level multi-dimensional Chain-of-Thought(CoT) reward model automatically. It generates code for solving visual tasks and transforms the analysis of code blocks into the evaluation of CoT step as training samples. We show that SVIP-Reward improves MLLM performance across training and inference-time scaling.
arXiv Detail & Related papers (2025-04-09T06:09:40Z)
Process Supervision-Guided Policy Optimization for Code Generation [15.943210767010045]
Reinforcement learning (RL) with unit test feedback has enhanced large language models' (LLMs) code generation, but relies on sparse rewards provided only after complete code evaluation. We propose a Process Reward Model (PRM) that delivers dense, line-level feedback on code correctness during generation, mimicking human code refinement.
arXiv Detail & Related papers (2024-10-23T07:22:33Z)
REvolve: Reward Evolution with Large Language Models using Human Feedback [6.4550546442058225]
Large language models (LLMs) have been used for reward generation from natural language task descriptions. LLMs, guided by human feedback, can be used to formulate reward functions that reflect human implicit knowledge. We introduce REvolve, a truly evolutionary framework that uses LLMs for reward design in reinforcement learning.
arXiv Detail & Related papers (2024-06-03T13:23:27Z)
Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning [7.07264650720021]
Sub-optimal Data Pre-training, SDP, is an approach that leverages reward-free, sub-optimal data to improve HitL RL algorithms. We show SDP can significantly improve or achieve competitive performance with state-of-the-art HitL RL algorithms.
arXiv Detail & Related papers (2024-04-30T18:58:33Z)
RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback [24.759613248409167]
Reward engineering has long been a challenge in Reinforcement Learning research. We propose RL-VLM-F, a method that automatically generates reward functions for agents to learn new tasks. We demonstrate that RL-VLM-F successfully produces effective rewards and policies across various domains.
arXiv Detail & Related papers (2024-02-06T04:06:06Z)
Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output. We use these attention weights to redistribute the reward along the whole completion. Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z)
Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint [104.53687944498155]
Reinforcement learning (RL) has been widely used in training large language models (LLMs) We propose a new RL method named RLMEC that incorporates a generative model as the reward model. Based on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process.
arXiv Detail & Related papers (2024-01-11T17:58:41Z)
REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world. Current methods to mitigate this misalignment work by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft [88.80684763462384]
This paper introduces an advanced learning system, named Auto MC-Reward, that leverages Large Language Models (LLMs) to automatically design dense reward functions. Experiments demonstrate a significant improvement in the success rate and learning efficiency of our agents in complex tasks in Minecraft.
arXiv Detail & Related papers (2023-12-14T18:58:12Z)
Iterative Reward Shaping using Human Feedback for Correcting Reward Misspecification [15.453123084827089]
ITERS is an iterative reward shaping approach using human feedback for mitigating the effects of a misspecified reward function. We evaluate ITERS in three environments and show that it can successfully correct misspecified reward functions.
arXiv Detail & Related papers (2023-08-30T11:45:40Z)
Reward Design with Language Models [27.24197025688919]
Reward design in reinforcement learning (RL) is challenging since specifying human notions of desired behavior may be difficult via reward functions or require expert demonstrations. Can we instead cheaply design rewards using a natural language interface? This paper explores how to simplify reward design by prompting a large language model (LLM) such as GPT-3 as a proxy reward function.
arXiv Detail & Related papers (2023-02-27T22:09:35Z)
Fantastic Rewards and How to Tame Them: A Case Study on Reward Learning for Task-oriented Dialogue Systems [111.80916118530398]
reinforcement learning (RL) techniques can naturally be utilized to train dialogue strategies to achieve user-specific goals. This paper aims at answering the question of how to efficiently learn and leverage a reward function for training end-to-end (E2E) ToD agents.
arXiv Detail & Related papers (2023-02-20T22:10:04Z)
Information Directed Reward Learning for Reinforcement Learning [64.33774245655401]
We learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible. In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types. We support our findings with extensive evaluations in multiple environments and with different types of queries.
arXiv Detail & Related papers (2021-02-24T18:46:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.