Related papers: From Correction to Mastery: Reinforced Distillation of Large Language Model Agents

From Correction to Mastery: Reinforced Distillation of Large Language Model Agents

URL: http://arxiv.org/abs/2509.14257v2
Date: Thu, 09 Oct 2025 04:22:47 GMT
Title: From Correction to Mastery: Reinforced Distillation of Large Language Model Agents
Authors: Yuanjie Lyu, Chengyu Wang, Jun Huang, Tong Xu,
Abstract summary: Large Language Model agents excel at solving complex tasks through iterative reasoning and tool use.<n>Existing distillation approaches train smaller students to imitate full teacher trajectories.<n>We propose SCoRe, a student-centered framework in which the student generates training trajectories and the teacher corrects only the earliest error.
Score: 13.982204994247718
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Model agents excel at solving complex tasks through iterative reasoning and tool use, but typically depend on ultra-large, costly backbones. Existing distillation approaches train smaller students to imitate full teacher trajectories, yet reasoning and knowledge gaps between the teacher and student can cause compounding errors. We propose SCoRe, a student-centered framework in which the student generates training trajectories and the teacher corrects only the earliest error, producing training data matched to the student's ability and exposing specific weaknesses. The student is first fine-tuned on corrected trajectories. Subsequently, short-horizon reinforcement learning starts from the verified prefix preceding the earliest error, with target rewards assigned at that step. This design encourages autonomous problem-solving beyond imitation and enhances training stability. On 12 challenging benchmarks, a 7B-parameter student distilled with SCoRe matches the agentic performance of a 72B-parameter teacher.

Related papers

Long-Chain Reasoning Distillation via Adaptive Prefix Alignment [57.130176131042965]
We propose a framework that exploits teacher CoTs for distillation through adaptive prefix alignment.<n>P-ALIGN adaptively truncates teacher-generated reasoning trajectories by determining whether the remaining suffix is concise.<n>Experiments on multiple mathematical reasoning benchmarks demonstrate that P-ALIGN outperforms all baselines by over 3%.
arXiv Detail & Related papers (2026-01-15T04:40:45Z)
Masking Teacher and Reinforcing Student for Distilling Vision-Language Models [50.619420197124356]
Large-scale vision-language models (VLMs) have recently achieved remarkable multimodal understanding.<n>This raises the need for compact yet capable VLMs that can efficiently learn from powerful large teachers.<n>We propose Masters (Masking Teacher and Reinforcing Student), a mask-progressive reinforcement learning framework.
arXiv Detail & Related papers (2025-12-23T14:40:38Z)
Personalized Distractor Generation via MCTS-Guided Reasoning Reconstruction [33.217474795590576]
Distractors, incorrect but plausible answer choices in multiple-choice questions (MCQs) play a critical role in educational assessment by diagnosing student misconceptions.<n>Recent work has leveraged large language models (LLMs) to generate shared, group-level distractors.<n>We introduce the task of personalized distractor generation, which aims to generate tailored distractors based on individual misconceptions inferred from each student's past question-answering (QA) records.
arXiv Detail & Related papers (2025-08-15T03:20:37Z)
SuperCorrect: Advancing Small LLM Reasoning with Thought Template Distillation and Self-Correction [89.56181323849512]
SuperCorrect is a novel two-stage framework that uses a large teacher model to supervise and correct both the reasoning and reflection processes of a smaller student model.<n>In the first stage, we extract hierarchical high-level and detailed thought templates from the teacher model to guide the student model in eliciting more fine-grained reasoning thoughts.<n>In the second stage, we introduce cross-model collaborative direct preference optimization (DPO) to enhance the self-correction abilities of the student model.
arXiv Detail & Related papers (2024-10-11T17:25:52Z)
AI-Driven Virtual Teacher for Enhanced Educational Efficiency: Leveraging Large Pretrain Models for Autonomous Error Analysis and Correction [21.159378560503036]
This paper introduces an innovative textbfVirtual textbfAI textbfTeacher system designed to autonomously analyze and correct student textbfErrors (VATE)<n>The system has been deployed on the Squirrel AI learning platform for elementary mathematics education, where it achieves 78.3% accuracy in error analysis.
arXiv Detail & Related papers (2024-09-14T10:27:36Z)
Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors [78.53699244846285]
Large language models (LLMs) present an opportunity to scale high-quality personalized education to all. LLMs struggle to precisely detect student's errors and tailor their feedback to these errors. Inspired by real-world teaching practice where teachers identify student errors and customize their response based on them, we focus on verifying student solutions.
arXiv Detail & Related papers (2024-07-12T10:11:40Z)
YODA: Teacher-Student Progressive Learning for Language Models [82.0172215948963]
This paper introduces YODA, a teacher-student progressive learning framework. It emulates the teacher-student education process to improve the efficacy of model fine-tuning. Experiments show that training LLaMA2 with data from YODA improves SFT with significant performance gain.
arXiv Detail & Related papers (2024-01-28T14:32:15Z)
Distantly-Supervised Named Entity Recognition with Adaptive Teacher Learning and Fine-grained Student Ensemble [56.705249154629264]
Self-training teacher-student frameworks are proposed to improve the robustness of NER models. In this paper, we propose an adaptive teacher learning comprised of two teacher-student networks. Fine-grained student ensemble updates each fragment of the teacher model with a temporal moving average of the corresponding fragment of the student, which enhances consistent predictions on each model fragment against noise.
arXiv Detail & Related papers (2022-12-13T12:14:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.