Related papers: Zero-Shot LLMs in Human-in-the-Loop RL: Replacing Human Feedback for Reward Shaping

Zero-Shot LLMs in Human-in-the-Loop RL: Replacing Human Feedback for Reward Shaping

URL: http://arxiv.org/abs/2503.22723v1
Date: Wed, 26 Mar 2025 03:17:12 GMT
Title: Zero-Shot LLMs in Human-in-the-Loop RL: Replacing Human Feedback for Reward Shaping
Authors: Mohammad Saif Nazir, Chayan Banerjee,
Abstract summary: Reinforcement learning often faces challenges with reward misalignment.<n>Human-in-the-loop (HIL) methods may exacerbate the problem, as humans are prone to biases that lead to inconsistent, subjective, or misaligned feedback.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Reinforcement learning often faces challenges with reward misalignment, where agents optimize for given rewards but fail to exhibit the desired behaviors. This occurs when the reward function incentivizes proxy behaviors that diverge from the true objective. While human-in-the-loop (HIL) methods can help, they may exacerbate the problem, as humans are prone to biases that lead to inconsistent, subjective, or misaligned feedback, complicating the learning process. To address these issues, we propose two key contributions. First, we extend the use of zero-shot, off-the-shelf large language models (LLMs) for reward shaping beyond natural language processing (NLP) to continuous control tasks. By leveraging LLMs as direct feedback providers, we replace surrogate models trained on human feedback, which often suffer from the bias inherent in the feedback data it is trained on. Second, we introduce a hybrid framework (LLM-HFBF) that enables LLMs to identify and correct biases in human feedback while incorporating this feedback into the reward shaping process. The LLM-HFBF framework creates a more balanced and reliable system by addressing both the limitations of LLMs (e.g., lack of domain-specific knowledge) and human supervision (e.g., inherent biases). By enabling human feedback bias flagging and correction, our approach improves reinforcement learning performance and reduces reliance on potentially biased human guidance. Empirical experiments show that biased human feedback significantly reduces performance, with average episodic reward (AER) dropping from 28.472 in (unbiased approaches) to 7.039 (biased with conservative bias). In contrast, LLM-based approaches maintain a matching AER like unbiased feedback, even in custom edge case scenarios.

Related papers

Intent-Aware Self-Correction for Mitigating Social Biases in Large Language Models [38.1620443730172]
Self-Correction based on feedback improves the output quality of Large Language Models (LLMs) In this study, we demonstrate that clarifying intentions is essential for effectively reducing biases in LLMs through Self-Correction.
arXiv Detail & Related papers (2025-03-08T02:20:43Z)
Linear Probe Penalties Reduce LLM Sycophancy [3.6490659260835234]
Large language models (LLMs) are often sycophantic, prioritizing agreement with their users over accurate or objective statements. This problematic behavior becomes more pronounced during reinforcement learning from human feedback (RLHF) We develop a linear probing method to identify and penalize markers of sycophancy within the reward model, producing rewards that discourage sycophantic behavior.
arXiv Detail & Related papers (2024-12-01T21:11:28Z)
Post-hoc Reward Calibration: A Case Study on Length Bias [28.266675778940133]
Reward models (RMs) can develop biases by exploiting spurious correlations in their training data. These biases can lead to incorrect output rankings, sub-optimal model evaluations, and the amplification of undesirable behaviours. This paper addresses the challenge of correcting such biases without additional data and training.
arXiv Detail & Related papers (2024-09-25T22:30:42Z)
Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback [8.601283886845664]
Reinforcement learning from human feedback (RLHF) aligns Large language models (LLMs) with human intentions and values. Despite its effectiveness and popularity, RLHF is prone to biased local optimization. We propose a novel textitsequence-to-sequence (seq2seq) reward modeling method.
arXiv Detail & Related papers (2024-08-30T16:14:35Z)
Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs [6.090496490133132]
We propose Reinforcement Learning from Multi-role Debates as Feedback (RLDF), a novel approach for bias mitigation replacing human feedback in traditional RLHF. We utilize LLMs in multi-role debates to create a dataset that includes both high-bias and low-bias instances for training the reward model in reinforcement learning.
arXiv Detail & Related papers (2024-04-15T22:18:50Z)
ODIN: Disentangled Reward Mitigates Hacking in RLHF [127.35607931337019]
We study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback. A well-formatted, verbose but less helpful response from the LLMs can often deceive LLMs or even human evaluators to achieve high scores. Our approach almost eliminates the reward correlation with length, and improves the obtained policy by a significant margin.
arXiv Detail & Related papers (2024-02-11T22:40:12Z)
REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.<n>Recent methods aim to mitigate misalignment by learning reward functions from human preferences.<n>We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback [61.28463542324576]
We present DRESS, a large vision language model (LVLM) that innovatively exploits Natural Language feedback (NLF) from Large Language Models. We propose a novel categorization of the NLF into two key types: critique and refinement. Our experimental results demonstrate that DRESS can generate more helpful (9.76%), honest (11.52%), and harmless (21.03%) responses.
arXiv Detail & Related papers (2023-11-16T18:37:29Z)
Do LLMs exhibit human-like response biases? A case study in survey design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all. We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
Reward Uncertainty for Exploration in Preference-based Reinforcement Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms. Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward. Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z)
PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning. We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.