LLM-Oriented Token-Adaptive Knowledge Distillation
- URL: http://arxiv.org/abs/2510.11615v1
- Date: Mon, 13 Oct 2025 16:55:07 GMT
- Title: LLM-Oriented Token-Adaptive Knowledge Distillation
- Authors: Xurong Xie, Zhucun Xue, Jiafu Wu, Jian Li, Yabiao Wang, Xiaobin Hu, Yong Liu, Jiangning Zhang,
- Abstract summary: We propose a novel framework that adapts the distillation process to the real-time learning state of each token.<n>AdaKD consists of two synergistic modules driven by a unified token difficulty metric.<n>As a plug-and-play framework, AdaKD can consistently improve the performance of various distillation methods on multiple model architectures and benchmarks.
- Score: 64.08412563818662
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation (KD) is a key technique for compressing large-scale language models (LLMs), yet prevailing logit-based methods typically employ static strategies that are misaligned with the dynamic learning process of student models. These methods typically treat all tokens indiscriminately and apply a single, fixed temperature, resulting in suboptimal knowledge transfer. To address these limitations, we propose LLM-Oriented Token-Adaptive Knowledge Distillation (AdaKD), a novel framework that adapts the distillation process to the real-time learning state of each token. AdaKD consists of two synergistic modules driven by a unified token difficulty metric. First, our Loss-Driven Adaptive Token Focusing (LATF) module dynamically adjusts the distillation focus by monitoring the student's learning stability, concentrating computational resources on the most valuable tokens at each training phase. Second, we introduce Inverse Difficulty Temperature Scaling (IDTS), a counterintuitive yet effective token-level temperature strategy. It employs low temperatures for difficult tokens for targeted error correction, and high temperatures for easy tokens to encourage students to learn from the teacher's complete and smooth output distribution, thereby enhancing generalization. As a plug-and-play framework, AdaKD can consistently improve the performance of various distillation methods on multiple model architectures and benchmarks.
Related papers
- LayerCake: Token-Aware Contrastive Decoding within Large Language Model Layers [53.43862310647276]
Large language models (LLMs) excel at natural language understanding and generation but remain vulnerable to factual errors.<n>We introduce a token-aware, layer-localized contrastive decoding method that aligns specific token types with their most influential transformer layers to improve factual generation.<n>Our method requires no additional training or model modification, and experiments demonstrate that our method consistently improves factuality across multiple LLMs and various benchmarks.
arXiv Detail & Related papers (2025-07-06T14:35:43Z) - Being Strong Progressively! Enhancing Knowledge Distillation of Large Language Models through a Curriculum Learning Framework [0.0]
Knowledge Distillation (KD) compresses large language models (LLMs) by transferring the teacher model's capabilities to a smaller student model.<n>Existing KD methods for LLMs often fail to prevent significant shifts in the student model's distribution during training.<n>We propose a novel, plug-in curriculum learning framework inspired by the strength training principle of "progressive overload"
arXiv Detail & Related papers (2025-06-06T02:48:38Z) - PATS: Process-Level Adaptive Thinking Mode Switching [53.53401063490537]
Current large-language models (LLMs) typically adopt a fixed reasoning strategy, either simple or complex, for all questions, regardless of their difficulty.<n>This neglect of variation in task and reasoning process complexity leads to an imbalance between performance and efficiency.<n>Existing methods attempt to implement training-free fast-slow thinking system switching to handle problems of varying difficulty, but are limited by coarse-grained solution-level strategy adjustments.<n>We propose a novel reasoning paradigm: Process-Level Adaptive Thinking Mode Switching (PATS), which enables LLMs to dynamically adjust their reasoning strategy based on the difficulty of each step, optimizing the balance between
arXiv Detail & Related papers (2025-05-25T17:58:50Z) - Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.<n> Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z) - Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models [81.74999702045339]
Multi-Level Optimal Transport (MultiLevelOT) is a novel approach that advances the optimal transport for universal cross-tokenizer knowledge distillation.<n>Our method aligns the logit distributions of the teacher and the student at both token and sequence levels.<n>At the token level, MultiLevelOT integrates both global and local information by jointly optimizing all tokens within a sequence to enhance robustness.
arXiv Detail & Related papers (2024-12-19T04:51:06Z) - Instance Temperature Knowledge Distillation [15.095465128404161]
Existing methods dynamically adjust the temperature to enable the student network to adapt to varying learning difficulties.<n>We formulate the adjustment of temperature as a sequential decision-making task and propose a method based on reinforcement learning.<n>Our framework can serve as a plug-and-play technique to be inserted into various KD methods easily.
arXiv Detail & Related papers (2024-06-27T14:00:05Z) - Dynamic Temperature Knowledge Distillation [9.6046915661065]
Temperature plays a pivotal role in moderating label softness in the realm of knowledge distillation (KD)
Traditional approaches often employ a static temperature throughout the KD process.
We propose Dynamic Temperature Knowledge Distillation (DTKD) which introduces a dynamic, cooperative temperature control for both teacher and student models simultaneously.
arXiv Detail & Related papers (2024-04-19T08:40:52Z) - Temperature Balancing, Layer-wise Weight Analysis, and Neural Network
Training [58.20089993899729]
This paper proposes TempBalance, a straightforward yet effective layerwise learning rate method.
We show that TempBalance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization.
We also show that TempBalance outperforms a number of state-of-the-art metrics and schedulers.
arXiv Detail & Related papers (2023-12-01T05:38:17Z) - Curriculum Temperature for Knowledge Distillation [30.94721463833605]
We propose a curriculum-based technique, termed Curriculum Temperature for Knowledge Distillation (CTKD)
CTKD controls the task difficulty level during the student's learning career through a dynamic and learnable temperature.
As an easy-to-use plug-in technique, CTKD can be seamlessly integrated into existing knowledge distillation frameworks.
arXiv Detail & Related papers (2022-11-29T14:10:35Z) - Self-Knowledge Distillation with Progressive Refinement of Targets [1.1470070927586016]
We propose a simple yet effective regularization method named progressive self-knowledge distillation (PS-KD)
PS-KD progressively distills a model's own knowledge to soften hard targets during training.
We show that PS-KD provides an effect of hard example mining by rescaling gradients according to difficulty in classifying examples.
arXiv Detail & Related papers (2020-06-22T04:06:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.