Fugu-MT 論文翻訳(概要): A Dual-Space Framework for General Knowledge Distillation of Large Language Models

論文の概要: A Dual-Space Framework for General Knowledge Distillation of Large Language Models

arxiv url: http://arxiv.org/abs/2504.11426v1
Date: Tue, 15 Apr 2025 17:38:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-04-23 23:08:55.838511
Title: A Dual-Space Framework for General Knowledge Distillation of Large Language Models
Title（参考訳）: 大規模言語モデルの一般知識蒸留のためのデュアルスペースフレームワーク
Authors: Xue Zhang, Songming Zhang, Yunlong Liang, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou,
Abstract要約: 知識蒸留(KD)は、より小さなモデルに知識を移すことによって、大きな言語モデル(LLM)を圧縮する有望なソリューションである。現在のWhite-box KDフレームワークには2つの制限がある。我々は,教師の予測ヘッドとKDの学生モデルを統合する,二空間知識蒸留(DSKD)フレームワークを提案する。
参考スコア（独自算出の注目度）: 98.73585104789217
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models. During this process, white-box KD methods usually minimize the distance between the output distributions of the teacher model and the student model to transfer more information. However, we reveal that the current white-box KD framework exhibits two limitations: a) bridging probability distributions from different output spaces will limit the similarity between the teacher model and the student model; b) this framework cannot be applied to LLMs with different vocabularies. One of the root causes for these limitations is that the distributions from the teacher and the student for KD are output by different prediction heads, which yield distributions in different output spaces and dimensions. Therefore, in this paper, we propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD. Specifically, we first introduce two projectors with ideal initialization to project the teacher/student hidden states into the student/teacher representation spaces. After this, the hidden states from different models can share the same head and unify the output spaces of the distributions. Furthermore, we develop an exact token alignment (ETA) algorithm to align the same tokens in two differently-tokenized sequences. Based on the above, our DSKD framework is a general KD framework that supports both off-policy and on-policy KD, and KD between any two LLMs regardless of their vocabularies. Extensive experiments on instruction-following, mathematical reasoning, and code generation benchmarks show that DSKD significantly outperforms existing methods based on the current white-box KD framework and surpasses other cross-tokenizer KD methods for LLMs with different vocabularies.
Abstract（参考訳）: 知識蒸留(KD)は、より小さなモデルに知識を移すことによって、大きな言語モデル(LLM)を圧縮する有望なソリューションである。この過程において、ホワイトボックスKD法は教師モデルの出力分布と生徒モデルの距離を最小化し、より多くの情報を伝達する。しかし、現在のWhite-box KDフレームワークには2つの制限があることが明らかになった。 a) 異なる出力空間からの分岐確率分布は、教師モデルと学生モデルとの類似性を制限します。 b) この枠組みは,異なる語彙を持つ LLM に適用することはできない。これらの制限の根本原因の1つは、教師とKDの学生の分布が異なる予測ヘッドによって出力され、異なる出力空間と次元の分布が生成されることである。そこで本稿では,教師の予測ヘッドとKDの学生モデルを統合する二空間知識蒸留(DSKD)フレームワークを提案する。具体的には、まず2つのプロジェクタを導入し、教師/学生の隠蔽状態を学生/教師の表現空間に投影する。その後、異なるモデルから隠された状態は同じヘッドを共有し、分布の出力空間を統一することができる。さらに、同じトークンを異なる2つの異なるトークン列に整列させるための正確なトークンアライメント(ETA)アルゴリズムを開発した。以上のことから,我々のDSKDフレームワークは,言語によらず,いずれかのLLM間でも,オフ・ポリティクスとオン・ポリティクスのKDをサポートする一般的なKDフレームワークである。命令追従、数学的推論、およびコード生成ベンチマークに関する大規模な実験により、DSKDは現在のホワイトボックスKDフレームワークに基づいて既存のメソッドよりも大幅に優れており、異なる語彙を持つLLMの他のクロストケナイザKDメソッドよりも優れていることが示されている。

論文の概要: A Dual-Space Framework for General Knowledge Distillation of Large Language Models

関連論文リスト