MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop
- URL: http://arxiv.org/abs/2601.22900v1
- Date: Fri, 30 Jan 2026 12:19:54 GMT
- Title: MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop
- Authors: Xuancheng Li, Haitao Li, Yujia Zhou, YiqunLiu, Qingyao Ai,
- Abstract summary: We investigate how to leverage richer verbal feedback to guide RLVR training on failed samples.<n>Specifically, we propose a multi-turn feedback-guided reinforcement learning framework.<n>It builds on three mechanisms: (1) dynamic multi-turn regeneration guided by feedback, triggered only on failed samples, (2) two complementary learning signals for within-turn and cross-turn optimization, and (3) structured feedback injection into the model's reasoning process.
- Score: 28.558050861419957
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in multiple domains, yet outcome-only scalar rewards are often sparse and uninformative, especially on failed samples, where they merely indicate failure and provide no insight into why the reasoning fails. In this paper, we investigate how to leverage richer verbal feedback to guide RLVR training on failed samples, and how to convert such feedback into a trainable learning signal. Specifically, we propose a multi-turn feedback-guided reinforcement learning framework. It builds on three mechanisms: (1) dynamic multi-turn regeneration guided by feedback, triggered only on failed samples, (2) two complementary learning signals for within-turn and cross-turn optimization, and (3) structured feedback injection into the model's reasoning process. Trained on sampled OpenR1-Math, the approach outperforms supervised fine-tuning and RLVR baselines in-domain and generalizes well out-of-domain.
Related papers
- Reinforcement Learning via Self-Distillation [37.078107691613155]
Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math.<n>Current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck.<n>We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO)<n>SDPO converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model.
arXiv Detail & Related papers (2026-01-28T17:45:12Z) - R^3: Replay, Reflection, and Ranking Rewards for LLM Reinforcement Learning [32.16683059021539]
Large reasoning models (LRMs) aim to solve diverse and complex problems through structured reasoning.<n>Recent advances in group-based policy optimization methods have shown promise in enabling stable advantage estimation without reliance on process-level annotations.<n>We propose a reinforcement learning mechanism named emphtextbfR3 that along three directions: (1) a emphcross-context underlinetextbfReplay strategy that maintains the intra-group advantage, (2) an emphin-context self-underlinetextbfReflection mechanism
arXiv Detail & Related papers (2026-01-27T13:55:34Z) - From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation [52.62655622099456]
We propose reinforcement learning with verifiable reference-based rewards (RLVRR)<n>Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e., reward chain)<n>In this way, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts, and style, which evaluates adherence to stylistic properties.
arXiv Detail & Related papers (2026-01-26T14:39:58Z) - Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models [61.78513830395669]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs)<n>As models train longer and scale larger, more training prompts become residual prompts, those with zero variance rewards that provide no training signal.<n>We propose the Explore Residual Prompts in Policy Optimization framework, which encourages exploration on residual prompts and reactivates their training signals.
arXiv Detail & Related papers (2025-11-06T20:40:27Z) - Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning [50.20267980386502]
We learn a dense, token-level reward model for process supervision directly from expert demonstrations.<n>The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets.
arXiv Detail & Related papers (2025-10-02T09:55:26Z) - ConfClip: Confidence-Weighted and Clipped Reward for Reinforcement Learning in LLMs [32.13266235550995]
Reinforcement learning (RL) has become a standard paradigm for refining large language models (LLMs)<n>Inspired by observations from human learning, we introduce a RL technique that integrates verifiable outcomes with the model's own confidence estimates.
arXiv Detail & Related papers (2025-09-22T13:00:35Z) - ReaLM: Reflection-Enhanced Autonomous Reasoning with Small Language Models [76.28894983518164]
Small Language Models (SLMs) are a cost-effective alternative to Large Language Models (LLMs)<n>They often struggle with complex reasoning due to their limited capacity and a tendency to produce mistakes or inconsistent answers.<n>We introduce ReaLM, a reinforcement learning framework for robust and self-sufficient reasoning in vertical domains.
arXiv Detail & Related papers (2025-08-17T14:50:23Z) - A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning [58.80217284841095]
Multi-turn problem solving is critical yet challenging for Large Reasoning Models (LRMs) to reflect on their reasoning and revise from feedback.<n>Existing Reinforcement Learning (RL) methods train large reasoning models on a single-turn paradigm with verifiable rewards.<n>We introduce Unary Feedback as Observation (UFO) for reinforcement learning, which uses minimal yet common unary user feedback during iterative problem solving.
arXiv Detail & Related papers (2025-07-18T18:07:38Z) - Learning Robust Recommender from Noisy Implicit Feedback [140.7090392887355]
We propose a new training strategy named Adaptive Denoising Training (ADT)
ADT adaptively prunes the noisy interactions by two paradigms (i.e., Truncated Loss and Reweighted Loss)
We consider extra feedback (e.g., rating) as auxiliary signal and propose three strategies to incorporate extra feedback into ADT.
arXiv Detail & Related papers (2021-12-02T12:12:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.