Related papers: Language Models Can Learn from Verbal Feedback Without Scalar Rewards

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

URL: http://arxiv.org/abs/2509.22638v1
Date: Fri, 26 Sep 2025 17:58:27 GMT
Title: Language Models Can Learn from Verbal Feedback Without Scalar Rewards
Authors: Renjie Luo, Zichen Liu, Xiangyan Liu, Chao Du, Min Lin, Wenhu Chen, Wei Lu, Tianyu Pang,
Abstract summary: We propose treating verbal feedback as a conditioning signal.<n>Inspired by language priors in text-to-image generation, we introduce the feedback-conditional policy.
Score: 88.82702433508393
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at https://github.com/sail-sg/feedback-conditional-policy.

Related papers

Expanding the Capabilities of Reinforcement Learning via Text Feedback [49.561885700139676]
We formalize a multi-turn RL setup, RL from Text Feedback (RLTF), where text feedback is available during training but not at inference.<n>To do this, we propose two methods: Self Distillation (RLTF-SD), which trains the single-turn policy to match its own feedback-conditioned second-turn generations; and Feedback Modeling (RLTF-FM), which predicts the feedback as an auxiliary objective.<n>Our results show that both methods consistently outperform strong baselines across benchmarks.
arXiv Detail & Related papers (2026-02-02T18:56:56Z)
Text2Grad: Reinforcement Learning from Natural Language Feedback [32.59003667154527]
We introduce Text2Grad, a-grained reinforcement paradigm that turns free-form textual feedback into span-level gradients.<n>Our results demonstrate that natural-language feedback, when converted to gradients, is a powerful signal for finegrained policy optimization.
arXiv Detail & Related papers (2025-05-28T13:23:49Z)
Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition [57.732148933412425]
We propose a large language model based reward decomposition framework for aligning dialogue agents.<n>We leverage the reasoning capabilities of a frozen, pretrained large language model to infer fine-grained local implicit rewards.<n>We evaluate both text-only and multimodal variants against state-of-the-art reward decomposition methods.
arXiv Detail & Related papers (2025-05-21T18:19:45Z)
Zero-Shot LLMs in Human-in-the-Loop RL: Replacing Human Feedback for Reward Shaping [2.427844597259453]
Reinforcement learning (RL) often struggles with reward misalignment.<n>Human-in-the-loop (HITL) methods can mitigate this issue, but they also introduce biases.<n>We propose two key contributions to address these challenges.
arXiv Detail & Related papers (2025-03-26T03:17:12Z)
Time-Reversal Provides Unsupervised Feedback to LLMs [31.575024356581846]
Time Reversed Language Models (TRLMs) can score and generate queries when conditioned on responses.<n>We show that TRLM scoring outperforms conventional forward scoring of response given query.
arXiv Detail & Related papers (2024-12-03T17:54:12Z)
Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback [8.601283886845664]
Reinforcement learning from human feedback (RLHF) aligns Large language models (LLMs) with human intentions and values. Despite its effectiveness and popularity, RLHF is prone to biased local optimization. We propose a novel textitsequence-to-sequence (seq2seq) reward modeling method.
arXiv Detail & Related papers (2024-08-30T16:14:35Z)
RLVF: Learning from Verbal Feedback without Overgeneralization [94.19501420241188]
We study the problem of incorporating verbal feedback without such overgeneralization. We develop a new method Contextualized Critiques with Constrained Preference Optimization (C3PO) Our approach effectively applies verbal feedback to relevant scenarios while preserving existing behaviors for other contexts.
arXiv Detail & Related papers (2024-02-16T18:50:24Z)
Direct Language Model Alignment from Online AI Feedback [78.40436231613754]
Direct alignment from preferences (DAP) methods have recently emerged as efficient alternatives to reinforcement learning from human feedback (RLHF) In this study, we posit that online feedback is key and improves DAP methods. Our method, online AI feedback (OAIF) uses an LLM as annotator: on each training, we sample two responses from the current model and prompt the LLM annotator to choose which one is preferred, thus providing online feedback.
arXiv Detail & Related papers (2024-02-07T12:31:13Z)
LiPO: Listwise Preference Optimization through Learning-to-Rank [62.02782819559389]
Policy can learn more effectively from a ranked list of plausible responses given the prompt.<n>We show that LiPO-$lambda$ can outperform DPO variants and SLiC by a clear margin on several preference alignment tasks.
arXiv Detail & Related papers (2024-02-02T20:08:10Z)
Improving Code Generation by Training with Natural Language Feedback [69.52985513422381]
We formalize an algorithm for learning from natural language feedback at training time instead, which we call learning from Language Feedback (ILF) ILF requires only a small amount of human-written feedback during training and does not require the same feedback at test time, making it both user-friendly and sample-efficient. We use ILF to improve a Codegen-Mono 6.1B model's pass@1 rate by 38% relative (and 10% absolute) on the Mostly Basic Python Problems (MBPP) benchmark.
arXiv Detail & Related papers (2023-03-28T16:15:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.