Related papers: Dual-Space Knowledge Distillation for Large Language Models

Dual-Space Knowledge Distillation for Large Language Models

URL: http://arxiv.org/abs/2406.17328v3
Date: Tue, 01 Oct 2024 16:45:12 GMT
Title: Dual-Space Knowledge Distillation for Large Language Models
Authors: Songming Zhang, Xue Zhang, Zengkui Sun, Yufeng Chen, Jinan Xu,
Abstract summary: We propose a dual-space knowledge distillation (DSKD) framework that unifies the output spaces of the two models for KD. Our framework is not only compatible with various distance functions for KD like the current framework, but also supports KD between any two LLMs regardless of their vocabularies.
Score: 39.798007795604676
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Knowledge distillation (KD) is known as a promising solution to compress large language models (LLMs) via transferring their knowledge to smaller models. During this process, white-box KD methods usually minimize the distance between the output distributions of the two models so that more knowledge can be transferred. However, in the current white-box KD framework, the output distributions are from the respective output spaces of the two models, using their own prediction heads. We argue that the space discrepancy will lead to low similarity between the teacher model and the student model on both representation and distribution levels. Furthermore, this discrepancy also hinders the KD process between models with different vocabularies, which is common for current LLMs. To address these issues, we propose a dual-space knowledge distillation (DSKD) framework that unifies the output spaces of the two models for KD. On the basis of DSKD, we further develop a cross-model attention mechanism, which can automatically align the representations of the two models with different vocabularies. Thus, our framework is not only compatible with various distance functions for KD (e.g., KL divergence) like the current framework, but also supports KD between any two LLMs regardless of their vocabularies. Experiments on task-agnostic instruction-following benchmarks show that DSKD significantly outperforms the current white-box KD framework with various distance functions, and also surpasses existing KD methods for LLMs with different vocabularies.

Related papers

A Dual-Space Framework for General Knowledge Distillation of Large Language Models [98.73585104789217]
Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models. The current white-box KD framework exhibits two limitations. We propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD.
arXiv Detail & Related papers (2025-04-15T17:38:47Z)
CoT2Align: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different Tokenizers [45.59157559718677]
Large Language Models (LLMs) achieve state-of-the-art performance across various NLP tasks but face deployment challenges due to high computational costs and memory constraints. Knowledge distillation (KD) is a promising solution, transferring knowledge from large teacher models to smaller student models. We propose CoT2Align, a universal KD framework that integrates Chain-of-Thought (CoT) augmentation and introduces Cross-CoT Alignment to enhance reasoning transfer.
arXiv Detail & Related papers (2025-02-24T03:30:29Z)
Direct Preference Knowledge Distillation for Large Language Models [73.50849692633953]
We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs) We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence. We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
arXiv Detail & Related papers (2024-06-28T09:23:40Z)
Sinkhorn Distance Minimization for Knowledge Distillation [97.64216712016571]
Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs) In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions.
arXiv Detail & Related papers (2024-02-27T01:13:58Z)
DistiLLM: Towards Streamlined Distillation for Large Language Models [53.46759297929675]
DistiLLM is a more effective and efficient KD framework for auto-regressive language models. DisiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs.
arXiv Detail & Related papers (2024-02-06T11:10:35Z)
MiniLLM: Knowledge Distillation of Large Language Models [112.93051247165089]
Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs) We propose a KD approach that distills LLMs into smaller language models. Our method is scalable for different model families with 120M to 13B parameters.
arXiv Detail & Related papers (2023-06-14T14:44:03Z)
KDExplainer: A Task-oriented Attention Model for Explaining Knowledge Distillation [59.061835562314066]
We introduce a novel task-oriented attention model, termed as KDExplainer, to shed light on the working mechanism underlying the vanilla KD. We also introduce a portable tool, dubbed as virtual attention module (VAM), that can be seamlessly integrated with various deep neural networks (DNNs) to enhance their performance under KD.
arXiv Detail & Related papers (2021-05-10T08:15:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.