Related papers: Following the Teacher's Footsteps: Scheduled Checkpoint Distillation for Domain-Specific LLMs

Following the Teacher's Footsteps: Scheduled Checkpoint Distillation for Domain-Specific LLMs

URL: http://arxiv.org/abs/2601.10114v1
Date: Thu, 15 Jan 2026 06:46:01 GMT
Title: Following the Teacher's Footsteps: Scheduled Checkpoint Distillation for Domain-Specific LLMs
Authors: Cheng Feng, Chaoliang Zhong, Jun Sun, Yusuke Oishi,
Abstract summary: Large language models (LLMs) are challenging to deploy for domain-specific tasks due to their massive scale.<n>While distilling a finetuned LLM into a smaller student model is a promising alternative, the capacity gap between teacher and student often leads to suboptimal performance.<n>We propose a novel theoretical insight: a student can outperform its teacher if its advantage on a Student-Favored Subdomain outweighs its deficit on the Teacher-Favored Subdomain.
Score: 5.786917616876281
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are challenging to deploy for domain-specific tasks due to their massive scale. While distilling a fine-tuned LLM into a smaller student model is a promising alternative, the capacity gap between teacher and student often leads to suboptimal performance. This raises a key question: when and how can a student model match or even surpass its teacher on domain-specific tasks? In this work, we propose a novel theoretical insight: a student can outperform its teacher if its advantage on a Student-Favored Subdomain (SFS) outweighs its deficit on the Teacher-Favored Subdomain (TFS). Guided by this insight, we propose Scheduled Checkpoint Distillation (SCD), which reduces the TFS deficit by emulating the teacher's convergence process during supervised fine-tuning (SFT) on the domain task, and a sample-wise Adaptive Weighting (AW) mechanism to preserve student strengths on SFS. Experiments across diverse domain tasks--including QA, NER, and text classification in multiple languages--show that our method consistently outperforms existing distillation approaches, allowing the student model to match or even exceed the performance of its fine-tuned teacher.

Related papers

DistillFSS: Synthesizing Few-Shot Knowledge into a Lightweight Segmentation Model [8.487765630753048]
Cross-Domain Few-Shot Semantics (CD-FSS) seeks to segment unknown classes in unseen domains.<n>We propose DistillFSS, a framework that embeds support-set knowledge directly into a model's parameters.<n>By internalizing few-shot reasoning into a dedicated layer within the student network, DistillFSS eliminates the need for support images at test time.
arXiv Detail & Related papers (2025-12-05T10:54:23Z)
Merge-of-Thought Distillation [23.53356244978525]
Merge-of-Thought Distillation (MoT) is a lightweight framework that alternates between teacher-specific supervised fine-tuning branches and weight-space merging the resulting student variants.<n>On competition math benchmarks, applying MoT to a Qwen3-14B student surpasses strong models including Deepseek-R1, Qwen3-32B, and OpenAI-O1.<n>MoT consistently outperforms the best single-teacher distillation, improves general reasoning beyond mathematics, and shows robustness to distribution-shifted and peer-level teachers.
arXiv Detail & Related papers (2025-09-10T17:46:57Z)
Enhancing Long-Chain Reasoning Distillation through Error-Aware Self-Reflection [64.73809794561305]
errOr-aware self-ReflectION (ORION) is a framework that refines teacher CoTs through an Error-Aware Reflection process.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that ORION consistently improves performance by more than 2% over all baselines.
arXiv Detail & Related papers (2025-05-28T08:57:03Z)
Knowledge Distillation of Domain-adapted LLMs for Question-Answering in Telecom [0.6897286554827872]
Knowledge Distillation (KD) is one of approaches to reduce the size of Large Language Models (LLMs)<n>For domain-specific tasks, it is not clear if teacher or student model, or both, must be considered for domain adaptation.<n>We design experiments to study the impact of vocabulary (same and different) and KD algorithms (vanilla KD and Dual Space KD, DSKD) on the distilled model.
arXiv Detail & Related papers (2025-04-28T17:19:25Z)
Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.<n> Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs [47.35598271306371]
Large Language Models (LLMs) have exhibited impressive capabilities in various tasks, yet their vast parameter sizes restrict their applicability in resource-constrained settings. Knowledge distillation (KD) offers a viable solution by transferring expertise from large teacher models to compact student models. We present PLaD, a novel preference-based LLM distillation framework.
arXiv Detail & Related papers (2024-06-05T03:08:25Z)
ERNIE 3.0 Tiny: Frustratingly Simple Method to Improve Task-Agnostic Distillation Generalization [36.338614215561805]
Task-agnostic knowledge distillation attempts to address the problem of deploying large pretrained language model in resource-constrained scenarios. We show that we can leverage multi-task learning in task-agnostic distillation to advance the generalization of the resulted student.
arXiv Detail & Related papers (2023-01-09T15:12:50Z)
Faculty Distillation with Optimal Transport [53.69235109551099]
We propose to link teacher's task and student's task by optimal transport. Based on the semantic relationship between their label spaces, we can bridge the support gap between output distributions. Experiments under various settings demonstrate the succinctness and versatility of our method.
arXiv Detail & Related papers (2022-04-25T09:34:37Z)
Representation Consolidation for Training Expert Students [54.90754502493968]
We show that a multi-head, multi-task distillation method is sufficient to consolidate representations from task-specific teacher(s) and improve downstream performance. Our method can also combine the representational knowledge of multiple teachers trained on one or multiple domains into a single model.
arXiv Detail & Related papers (2021-07-16T17:58:18Z)
Graph Consistency based Mean-Teaching for Unsupervised Domain Adaptive Person Re-Identification [54.58165777717885]
This paper proposes a Graph Consistency based Mean-Teaching (GCMT) method with constructing the Graph Consistency Constraint (GCC) between teacher and student networks. Experiments on three datasets, i.e., Market-1501, DukeMTMCreID, and MSMT17, show that proposed GCMT outperforms state-of-the-art methods by clear margin.
arXiv Detail & Related papers (2021-05-11T04:09:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.