Related papers: Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

URL: http://arxiv.org/abs/2602.17686v1
Date: Thu, 05 Feb 2026 05:27:11 GMT
Title: Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO
Authors: Bowen Yu, Maolin Wang, Sheng Zhang, Binhao Wang, Yi Wen, Jingtong Gao, Bowen Liu, Zimo Zhao, Wanyu Wang, Xiangyu Zhao,
Abstract summary: Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge.<n>Existing approaches either compress reasoning into single-step, losing the interpretability that makes CoT valuable.<n>We present a three-stage curriculum learning framework that addresses this capacity mismatch through progressive skill acquisition.
Score: 24.91321958525287
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge: teacher rationales are often too verbose for smaller models to faithfully reproduce. Existing approaches either compress reasoning into single-step, losing the interpretability that makes CoT valuable. We present a three-stage curriculum learning framework that addresses this capacity mismatch through progressive skill acquisition. First, we establish structural understanding via masked shuffled reconstruction. Second, we apply Group Relative Policy Optimization (GRPO) on masked completion tasks, enabling the model to discover its own balance between accuracy and brevity. Third, we identify persistent failure cases and guide the student to internalize teacher knowledge through targeted rewriting, again optimized with GRPO. Experiments on GSM8K demonstrate that our approach enables Qwen2.5-3B-Base to achieve an 11.29 percent accuracy improvement while reducing output length by 27.4 percent, surpassing both instruction-tuned variants and prior distillation methods.

Related papers

On-Policy Context Distillation for Language Models [92.82835176360864]
We propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation.<n>We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation and system prompt distillation.
arXiv Detail & Related papers (2026-02-12T18:58:28Z)
Temper-Then-Tilt: Principled Unlearning for Generative Models through Tempering and Classifier Guidance [51.532841645285835]
We study machine unlearning in large generative models by framing the task as density ratio estimation to a target distribution.<n>We show it can fail to faithfully unlearn with finite samples when the forget set represents a sharp, concentrated data distribution.<n>We introduce Temper-Then-Tilt Unlearning (T3-Unlearning), which freezes the base model and applies a two-step inference procedure.
arXiv Detail & Related papers (2026-02-10T19:08:40Z)
Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models [23.128973540926552]
Endogenous Reprompting transforms the model's understanding into an explicit generative reasoning step.<n>We show that SEER consistently outperforms state-of-the-art baselines in evaluation accuracy, reprompting efficiency, and generation quality.
arXiv Detail & Related papers (2026-01-28T06:54:36Z)
Structured Reasoning for Large Language Models [59.215789462977206]
We propose Structured Reasoning (SCR), a framework that decouples reasoning trajectories into explicit, evaluable, and trainable components.<n>SCR substantially improves reasoning efficiency and self-verification.<n>Compared with existing reasoning paradigms, it reduces output token length by up to 50%.
arXiv Detail & Related papers (2026-01-12T04:04:01Z)
Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training [76.12556589212666]
We show that curriculum post-training avoids the exponential complexity bottleneck.<n>Under outcome-only reward signals, reinforcement learning finetuning achieves high accuracy with sample complexity.<n>We establish guarantees for test-time scaling, where curriculum-aware querying reduces both reward oracle calls and sampling cost from exponential to order.
arXiv Detail & Related papers (2025-11-10T18:29:54Z)
Can an Easy-to-Hard Curriculum Make Reasoning Emerge in Small Language Models? Evidence from a Four-Stage Curriculum on GPT-2 [0.8423417997128777]
We demonstrate that a developmentally ordered curriculum markedly improves reasoning transparency and sample-efficiency in small language models.<n>We identify challenges: final-answer success still lags a conventional run by about 30%, and our saliency probe under-detects verbal-knowledge heads in the hardest stage.
arXiv Detail & Related papers (2025-05-16T19:08:31Z)
Critique-Guided Distillation for Efficient and Robust Language Model Reasoning [4.8433206430407045]
Supervised fine-tuning with expert demonstrations often suffers from the imitation problem.<n>We propose Critique-Guided Distillation (CGD), a multi-stage training framework that augments SFT with teacher-generated explanatory critiques and refined responses.<n>Our analyses show that CGD consistently reduces refinement uncertainty, improves alignment between critiques and responses, and enhances sample efficiency.
arXiv Detail & Related papers (2025-05-16T18:45:59Z)
Improving In-Context Learning with Reasoning Distillation [25.377625891065236]
Language models rely on semantic priors to perform in-context learning.<n>We propose ReDis, a reasoning distillation technique designed to improve the inductive reasoning capabilities of language models.
arXiv Detail & Related papers (2025-04-14T18:59:10Z)
Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher [43.678380057638016]
Gap Preserving Distillation (GPD) method trains an additional dynamic teacher model from scratch along with training the student to bridge this gap. In experiments, GPD significantly outperforms existing distillation methods on top of both CNNs and transformers architectures. GPD also generalizes well to the scenarios without a pre-trained teacher, including training from scratch and fine-tuning, yielding a large improvement of 1.80% and 0.89% on ResNet18.
arXiv Detail & Related papers (2024-10-05T12:29:51Z)
Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation. At the sequence level, we propose a sequence correction and re-generation strategy. At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function. At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z)
Few-shot Relational Reasoning via Connection Subgraph Pretraining [81.30830261527231]
Connection Subgraph Reasoner (CSR) can make predictions for the target few-shot task directly without the need for pre-training. Our framework can already perform competitively to existing methods on target few-shot tasks.
arXiv Detail & Related papers (2022-10-13T04:35:14Z)
Weakly Supervised Semantic Segmentation via Alternative Self-Dual Teaching [82.71578668091914]
This paper establishes a compact learning framework that embeds the classification and mask-refinement components into a unified deep model. We propose a novel alternative self-dual teaching (ASDT) mechanism to encourage high-quality knowledge interaction.
arXiv Detail & Related papers (2021-12-17T11:56:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.