Related papers: Multi-turn Training with Basic Human Feedback Helps Little on LLM Reasoning

Multi-turn Training with Basic Human Feedback Helps Little on LLM Reasoning

URL: http://arxiv.org/abs/2510.21339v2
Date: Mon, 27 Oct 2025 11:23:40 GMT
Title: Multi-turn Training with Basic Human Feedback Helps Little on LLM Reasoning
Authors: Qiang Liu, Wuganjing Song, Zhenzhou Lin, Feifan Chen, Qiaolong Cai, Chen Li, Yongduo Sui,
Abstract summary: We study whether multi-turn training with human feedback is necessary for reasoning tasks.<n>We find that models trained in a single-turn setting generalize effectively to both single- and multi-turn evaluations.
Score: 11.361171211215597
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The reasoning capabilities of Large Language Models (LLMs) are typically developed through the single-turn reinforcement learning, whereas real-world applications often involve multi-turn interactions with human feedback, leading to a potential mismatch between training and deployment conditions. In this work, we study whether multi-turn training with human feedback is necessary for reasoning tasks. We compare conventional single-turn training with three multi-turn strategies and reach contrary conclusions to previous research. We find that models trained in a single-turn setting generalize effectively to both single- and multi-turn evaluations, while models trained with multi-turn strategies exhibit a significant degradation in single-turn reasoning performance. These results suggest that for tasks with complete information, robust single-turn training remains more effective and reliable, as multi-turn training with basic feedback provides limited benefits and can even degrade reasoning capabilities.

Related papers

Prepare Reasoning Language Models for Multi-Agent Debate with Self-Debate Reinforcement Learning [49.99694105650486]
Self-Debate Reinforcement Learning (SDRL) is a training framework that equips a single large language model with strong problem-solving ability.<n>We show that SDRL improves overall Multi-Agent Debate (MAD) performance while simultaneously strengthening single model reasoning.
arXiv Detail & Related papers (2026-01-29T20:21:44Z)
VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning [69.44871115752055]
We propose an advanced multimodal reasoning model trained via a novel Progressive Curriculum Reinforcement Learning (PCuRL) framework.<n>PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts.<n>The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity.
arXiv Detail & Related papers (2025-07-30T12:23:21Z)
A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning [58.80217284841095]
Multi-turn problem solving is critical yet challenging for Large Reasoning Models (LRMs) to reflect on their reasoning and revise from feedback.<n>Existing Reinforcement Learning (RL) methods train large reasoning models on a single-turn paradigm with verifiable rewards.<n>We introduce Unary Feedback as Observation (UFO) for reinforcement learning, which uses minimal yet common unary user feedback during iterative problem solving.
arXiv Detail & Related papers (2025-07-18T18:07:38Z)
Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models [79.90523648823522]
Multi-stage continual learning can lead to catastrophic forgetting.<n>This paper evaluates three mitigation strategies-model merging, discounting the LoRA scaling factor, and experience replay.<n>Results show that experience replay is the most effective, with further gains achieved by combining it with other methods.
arXiv Detail & Related papers (2025-05-23T05:50:14Z)
Revisiting the Relationship between Adversarial and Clean Training: Why Clean Training Can Make Adversarial Training Better [1.1970409518725493]
Adversarial training (AT) is an effective technique for enhancing adversarial robustness, but it comes at the cost of a decline in generalization ability.<n>Recent studies have attempted to use clean training to assist adversarial training, yet there are contradictions among the conclusions.<n>We propose a new idea of leveraging clean training to further improve the performance of advanced AT methods.
arXiv Detail & Related papers (2025-03-30T15:58:41Z)
Diving into Self-Evolving Training for Multimodal Reasoning [36.70979791148913]
Self-evolving trainin has emerged as a key approach for complex reasoning tasks.<n>This paper reframes self-evolving training for multimodal reasoning through the lens of reinforcement learning.<n>We propose M-STAR, a framework that achieves consistent performance gains across models of varying sizes and diverse benchmarks.
arXiv Detail & Related papers (2024-12-23T10:18:41Z)
How to Train Your Multi-Exit Model? Analyzing the Impact of Training Strategies [3.1836117900874825]
Early exits enable the network's forward pass to terminate early by attaching trainable internal classifiers to the backbone network.<n>Existing early-exit methods typically adopt either a joint training approach, where the backbone and exit heads are trained simultaneously, or a disjoint approach, where the heads are trained separately.<n>This paper introduces a set of metrics to analyze early-exit training dynamics and guide the choice of training strategy.
arXiv Detail & Related papers (2024-07-19T13:56:57Z)
Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning [59.98430756337374]
Supervised fine-tuning enhances the problem-solving abilities of language models across various mathematical reasoning tasks. Our work introduces a novel technique aimed at cultivating a deeper understanding of the training problems at hand. We propose reflective augmentation, a method that embeds problem reflection into each training instance.
arXiv Detail & Related papers (2024-06-17T19:42:22Z)
Towards Reasoning in Large Language Models via Multi-Agent Peer Review Collaboration [28.299379264080603]
Large Language Models (LLMs) have shown remarkable capabilities in general natural language processing tasks but often fall short in complex reasoning tasks. Recent studies have explored human-like problem-solving strategies, such as self-correct, to push further the boundary of single-model reasoning ability. We introduce a multi-agent collaboration strategy that emulates the academic peer review process.
arXiv Detail & Related papers (2023-11-14T13:27:07Z)
Multimodal Guidance Network for Missing-Modality Inference in Content Moderation [6.933986643759809]
We propose a novel guidance network that promotes knowledge sharing during training. We show that our proposed framework trains single-modality models that significantly outperform traditionally trained counterparts.
arXiv Detail & Related papers (2023-09-07T02:26:55Z)
PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning. We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.