Multi-turn Training with Basic Human Feedback Helps Little on LLM Reasoning
- URL: http://arxiv.org/abs/2510.21339v2
- Date: Mon, 27 Oct 2025 11:23:40 GMT
- Title: Multi-turn Training with Basic Human Feedback Helps Little on LLM Reasoning
- Authors: Qiang Liu, Wuganjing Song, Zhenzhou Lin, Feifan Chen, Qiaolong Cai, Chen Li, Yongduo Sui,
- Abstract summary: We study whether multi-turn training with human feedback is necessary for reasoning tasks.<n>We find that models trained in a single-turn setting generalize effectively to both single- and multi-turn evaluations.
- Score: 11.361171211215597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The reasoning capabilities of Large Language Models (LLMs) are typically developed through the single-turn reinforcement learning, whereas real-world applications often involve multi-turn interactions with human feedback, leading to a potential mismatch between training and deployment conditions. In this work, we study whether multi-turn training with human feedback is necessary for reasoning tasks. We compare conventional single-turn training with three multi-turn strategies and reach contrary conclusions to previous research. We find that models trained in a single-turn setting generalize effectively to both single- and multi-turn evaluations, while models trained with multi-turn strategies exhibit a significant degradation in single-turn reasoning performance. These results suggest that for tasks with complete information, robust single-turn training remains more effective and reliable, as multi-turn training with basic feedback provides limited benefits and can even degrade reasoning capabilities.
Related papers
- Prepare Reasoning Language Models for Multi-Agent Debate with Self-Debate Reinforcement Learning [49.99694105650486]
Self-Debate Reinforcement Learning (SDRL) is a training framework that equips a single large language model with strong problem-solving ability.<n>We show that SDRL improves overall Multi-Agent Debate (MAD) performance while simultaneously strengthening single model reasoning.
arXiv Detail & Related papers (2026-01-29T20:21:44Z) - VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning [69.44871115752055]
We propose an advanced multimodal reasoning model trained via a novel Progressive Curriculum Reinforcement Learning (PCuRL) framework.<n>PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts.<n>The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity.
arXiv Detail & Related papers (2025-07-30T12:23:21Z) - A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning [58.80217284841095]
Multi-turn problem solving is critical yet challenging for Large Reasoning Models (LRMs) to reflect on their reasoning and revise from feedback.<n>Existing Reinforcement Learning (RL) methods train large reasoning models on a single-turn paradigm with verifiable rewards.<n>We introduce Unary Feedback as Observation (UFO) for reinforcement learning, which uses minimal yet common unary user feedback during iterative problem solving.
arXiv Detail & Related papers (2025-07-18T18:07:38Z) - Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models [79.90523648823522]
Multi-stage continual learning can lead to catastrophic forgetting.<n>This paper evaluates three mitigation strategies-model merging, discounting the LoRA scaling factor, and experience replay.<n>Results show that experience replay is the most effective, with further gains achieved by combining it with other methods.
arXiv Detail & Related papers (2025-05-23T05:50:14Z) - Revisiting the Relationship between Adversarial and Clean Training: Why Clean Training Can Make Adversarial Training Better [1.1970409518725493]
Adversarial training (AT) is an effective technique for enhancing adversarial robustness, but it comes at the cost of a decline in generalization ability.<n>Recent studies have attempted to use clean training to assist adversarial training, yet there are contradictions among the conclusions.<n>We propose a new idea of leveraging clean training to further improve the performance of advanced AT methods.
arXiv Detail & Related papers (2025-03-30T15:58:41Z) - Diving into Self-Evolving Training for Multimodal Reasoning [36.70979791148913]
Self-evolving trainin has emerged as a key approach for complex reasoning tasks.<n>This paper reframes self-evolving training for multimodal reasoning through the lens of reinforcement learning.<n>We propose M-STAR, a framework that achieves consistent performance gains across models of varying sizes and diverse benchmarks.
arXiv Detail & Related papers (2024-12-23T10:18:41Z) - How to Train Your Multi-Exit Model? Analyzing the Impact of Training Strategies [3.1836117900874825]
Early exits enable the network's forward pass to terminate early by attaching trainable internal classifiers to the backbone network.<n>Existing early-exit methods typically adopt either a joint training approach, where the backbone and exit heads are trained simultaneously, or a disjoint approach, where the heads are trained separately.<n>This paper introduces a set of metrics to analyze early-exit training dynamics and guide the choice of training strategy.
arXiv Detail & Related papers (2024-07-19T13:56:57Z) - Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning [59.98430756337374]
Supervised fine-tuning enhances the problem-solving abilities of language models across various mathematical reasoning tasks.
Our work introduces a novel technique aimed at cultivating a deeper understanding of the training problems at hand.
We propose reflective augmentation, a method that embeds problem reflection into each training instance.
arXiv Detail & Related papers (2024-06-17T19:42:22Z) - Towards Reasoning in Large Language Models via Multi-Agent Peer Review
Collaboration [28.299379264080603]
Large Language Models (LLMs) have shown remarkable capabilities in general natural language processing tasks but often fall short in complex reasoning tasks.
Recent studies have explored human-like problem-solving strategies, such as self-correct, to push further the boundary of single-model reasoning ability.
We introduce a multi-agent collaboration strategy that emulates the academic peer review process.
arXiv Detail & Related papers (2023-11-14T13:27:07Z) - Multimodal Guidance Network for Missing-Modality Inference in Content Moderation [6.933986643759809]
We propose a novel guidance network that promotes knowledge sharing during training.
We show that our proposed framework trains single-modality models that significantly outperform traditionally trained counterparts.
arXiv Detail & Related papers (2023-09-07T02:26:55Z) - PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via
Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning.
We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.