DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning
- URL: http://arxiv.org/abs/2505.15734v2
- Date: Tue, 30 Sep 2025 11:47:26 GMT
- Title: DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning
- Authors: Gaurav Srivastava, Zhenyu Bi, Meng Lu, Xuan Wang,
- Abstract summary: Large language models (LLMs) have improved significantly in their reasoning through extensive training on massive datasets.<n>We propose Debate, Train, Evolve (DTE), a ground truth-free training framework that uses multi-agent debate traces to evolve a single language model.
- Score: 12.317361893756475
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have improved significantly in their reasoning through extensive training on massive datasets. However, relying solely on additional data for improvement is becoming increasingly impractical, highlighting the need for models to autonomously enhance their reasoning without external supervision. In this paper, we propose Debate, Train, Evolve (DTE), a novel ground truth-free training framework that uses multi-agent debate traces to evolve a single language model. We also introduce a new prompting strategy Reflect-Critique-Refine, to improve debate quality by explicitly instructing agents to critique and refine their reasoning. Extensive evaluations on seven reasoning benchmarks with six open-weight models show that our DTE framework achieve substantial improvements, with an average accuracy gain of 8.92% on the challenging GSM-PLUS dataset. Furthermore, we observe strong cross-domain generalization, with an average accuracy gain of 5.8% on all other benchmarks, suggesting that our method captures general reasoning capabilities. Our framework code and trained models are publicly available at https://github.com/ctrl-gaurav/Debate-Train-Evolve
Related papers
- When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents [2.689316553293938]
Supervised fine-tuning (SFT) has emerged as one of the most effective ways to improve the performance of large language models (LLMs) in downstream tasks.<n>We propose a pipeline in which LLMs generate reasoning steps that guide both the invocation of tools and the final answer generation for conversational agents.
arXiv Detail & Related papers (2025-12-12T04:44:40Z) - SPELL: Self-Play Reinforcement Learning for evolving Long-Context Language Models [79.01078135582127]
SPELL enables scalable, label-free optimization for long-context reasoning.<n>We introduce an automated curriculum that gradually increases document length and a reward function that adapts question difficulty to the model's evolving capabilities.
arXiv Detail & Related papers (2025-09-28T13:08:10Z) - Scaling Reasoning can Improve Factuality in Large Language Models [7.184302333801519]
We thoroughly examine large language model (LLM) reasoning within complex open-domain question-answering (QA) scenarios.<n>To enrich reasoning traces, we introduce factual information from knowledge graphs in the form of paths into our reasoning traces.<n>Our findings indicate that, within a single run, smaller reasoning models achieve noticeable improvements in factual accuracy compared to their original instruction-tuned counterparts.
arXiv Detail & Related papers (2025-05-16T11:39:33Z) - Phi-4-reasoning Technical Report [42.508165017775]
We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks.<n>We develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning.<n>Both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model.
arXiv Detail & Related papers (2025-04-30T05:05:09Z) - R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.<n>Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.<n>We propose Reasoning-Driven Process Reward Modeling (R-PRM)<n>R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z) - The Surprising Effectiveness of Test-Time Training for Few-Shot Learning [59.309477460893916]
Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks.<n>We investigate the effectiveness of test-time training (TTT) as a mechanism for improving LMs' reasoning and few-shot learning capabilities.<n>Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.
arXiv Detail & Related papers (2024-11-11T18:59:45Z) - Improve Vision Language Model Chain-of-thought Reasoning [86.83335752119741]
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness.
We show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses.
arXiv Detail & Related papers (2024-10-21T17:00:06Z) - VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment [55.7956150385255]
We investigate the efficacy of AI feedback to scale supervision for aligning vision-language models.
We introduce VLFeedback, the first large-scale vision-language feedback dataset.
We train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback.
arXiv Detail & Related papers (2024-10-12T07:56:47Z) - Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance.<n>Existing direct preference learning algorithms are originally designed for the single-turn chat task.<n>We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z) - SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only.
Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z) - Self-training Language Models for Arithmetic Reasoning [0.0]
We explore the potential of improving models' reasoning capabilities without new data.
We find that models can substantially improve in both single-round (offline) and online self-training.
arXiv Detail & Related papers (2024-07-11T11:06:05Z) - Teaching Language Models to Self-Improve by Learning from Language Feedback [40.649677201161744]
We present Self-Refinement Tuning (SRT), a method that leverages model feedback for alignment.
SRT uses a base language model (e.g., Tulu2) to generate initial responses, which are critiqued and refined by a more advanced model.
SRT further optimize the model by learning from its self-generated feedback and refinements, creating a feedback loop that promotes model improvement.
arXiv Detail & Related papers (2024-06-11T11:20:05Z) - Improving Language Model Reasoning with Self-motivated Learning [60.779625789039486]
textitSelf-motivated Learning framework motivates the model itself to automatically generate rationales on existing datasets.
We train a reward model with the rank to evaluate the quality of rationales, and improve the performance of reasoning through reinforcement learning.
arXiv Detail & Related papers (2024-04-10T14:05:44Z) - GEO-Bench: Toward Foundation Models for Earth Monitoring [139.77907168809085]
We propose a benchmark comprised of six classification and six segmentation tasks.
This benchmark will be a driver of progress across a variety of Earth monitoring tasks.
arXiv Detail & Related papers (2023-06-06T16:16:05Z) - An automatically discovered chain-of-thought prompt generalizes to novel
models and datasets [4.693905948827508]
Chain-of-thought (CoT) reasoning capabilities promise to improve performance and explainability of large language models (LLMs)
We compare different reasoning strategies induced by zero-shot prompting across six recently released LLMs.
Our findings demonstrate that while some variations in effectiveness occur, gains from CoT reasoning strategies remain robust across different models and datasets.
arXiv Detail & Related papers (2023-05-04T15:07:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.