Related papers: Distillation of Large Language Models via Concrete Score Matching

Distillation of Large Language Models via Concrete Score Matching

URL: http://arxiv.org/abs/2509.25837v1
Date: Tue, 30 Sep 2025 06:21:28 GMT
Title: Distillation of Large Language Models via Concrete Score Matching
Authors: Yeongmin Kim, Donghyeok Shin, Mina Kang, Byeonghu Na, Il-Chul Moon,
Abstract summary: Large language models (LLMs) deliver remarkable performance but are costly to deploy, motivating knowledge distillation (KD) for efficient inference.<n>We propose a discrete score-matching objective that overcomes both softmax-induced smoothing and restrictions on the optimal solution set.<n> Experiments show that CSD consistently surpasses recent KD objectives, achieves favorable fidelity-diversity trade-offs, and yields complementary gains when combined with on-policy techniques.
Score: 28.320219993420434
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) deliver remarkable performance but are costly to deploy, motivating knowledge distillation (KD) for efficient inference. Existing KD objectives typically match student and teacher probabilities via softmax, which blurs valuable logit information. While direct logit distillation (DLD) mitigates softmax smoothing, it fails to account for logit shift invariance, thereby restricting the solution space. We propose Concrete Score Distillation (CSD), a discrete score-matching objective that overcomes both softmax-induced smoothing and restrictions on the optimal solution set. We resolve the training instability and quadratic complexity of discrete score-matching in autoregressive LLMs, and the resulting CSD objective aligns relative logit differences across all vocabulary pairs between student and teacher with flexible weighting. We provide both mode-seeking and mode-covering instances within our framework and evaluate CSD on task-agnostic instruction-following and task-specific distillation using GPT-2-1.5B, OpenLLaMA-7B, and GEMMA-7B-IT. Experiments show that CSD consistently surpasses recent KD objectives, achieves favorable fidelity-diversity trade-offs, and yields complementary gains when combined with on-policy techniques, demonstrating its scalability and effectiveness for LLM distillation.

Related papers

Rethinking Selective Knowledge Distillation [21.167064592056196]
It remains unclear which importance signals, selection policies, and their interplay are most effective.<n>We introduce student-entropy-guided position selection (SE-KD) across the class and sample axes.<n>This approach yields complementary efficiency gains that make offline teacher caching feasible.
arXiv Detail & Related papers (2026-02-01T18:58:27Z)
LLM-Oriented Token-Adaptive Knowledge Distillation [64.08412563818662]
We propose a novel framework that adapts the distillation process to the real-time learning state of each token.<n>AdaKD consists of two synergistic modules driven by a unified token difficulty metric.<n>As a plug-and-play framework, AdaKD can consistently improve the performance of various distillation methods on multiple model architectures and benchmarks.
arXiv Detail & Related papers (2025-10-13T16:55:07Z)
Less is More: Selective Reflection for Compatible and Efficient Knowledge Distillation in Large Language Models [0.0]
Knowledge Distillation (KD) is a technique for compressing large language models (LLMs) into compact, efficient student models.<n>We propose Selective Reflection Distillation (SRD), a novel data curation framework.<n>As a plug-and-play enhancement, SRD consistently improves distillation outcomes across diverse white-box KD approaches.
arXiv Detail & Related papers (2025-08-08T08:55:53Z)
Being Strong Progressively! Enhancing Knowledge Distillation of Large Language Models through a Curriculum Learning Framework [0.0]
Knowledge Distillation (KD) compresses large language models (LLMs) by transferring the teacher model's capabilities to a smaller student model.<n>Existing KD methods for LLMs often fail to prevent significant shifts in the student model's distribution during training.<n>We propose a novel, plug-in curriculum learning framework inspired by the strength training principle of "progressive overload"
arXiv Detail & Related papers (2025-06-06T02:48:38Z)
Structured Agent Distillation for Large Language Model [56.38279355868093]
We propose Structured Agent Distillation, a framework that compresses large LLM-based agents into smaller student models.<n>Our method segments trajectories into [REASON] and [ACT] spans, applying segment-specific losses to align each component with the teacher's behavior.<n>Experiments on ALFWorld, HotPotQA-ReAct, and WebShop show that our approach consistently outperforms token-level and imitation learning baselines.
arXiv Detail & Related papers (2025-05-20T02:01:55Z)
ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via $α$-$β$-Divergence [89.630486749083]
Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student model.<n>The core challenge in KD lies in balancing two mode-concentration effects.<n>We propose ABKD, a generic framework with $alpha$$beta$-divergence.
arXiv Detail & Related papers (2025-05-07T16:48:49Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD) RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals. Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z)
Direct Preference Knowledge Distillation for Large Language Models [73.50849692633953]
We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs)<n>We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence.<n>We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
arXiv Detail & Related papers (2024-06-28T09:23:40Z)
Self-Distillation from the Last Mini-Batch for Consistency Regularization [14.388479145440636]
We propose an efficient and reliable self-distillation framework, named Self-Distillation from Last Mini-Batch (DLB) Our proposed mechanism guides the training stability and consistency, resulting in robustness to label noise. Experimental results on three classification benchmarks illustrate that our approach can consistently outperform state-of-the-art self-distillation approaches.
arXiv Detail & Related papers (2022-03-30T09:50:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.