Fugu-MT 論文翻訳(概要): Dual-Space Knowledge Distillation for Large Language Models

論文の概要: Dual-Space Knowledge Distillation for Large Language Models

arxiv url: http://arxiv.org/abs/2406.17328v3
Date: Tue, 01 Oct 2024 16:45:12 GMT
ステータス: 翻訳完了
システム内更新日: 2024-12-02 07:46:08.620252
Title: Dual-Space Knowledge Distillation for Large Language Models
Title（参考訳）: 大規模言語モデルのための二重空間知識蒸留
Authors: Songming Zhang, Xue Zhang, Zengkui Sun, Yufeng Chen, Jinan Xu,
Abstract要約: KDのための2つのモデルの出力空間を統一する二空間知識蒸留(DSKD)フレームワークを提案する。我々のフレームワークは、現在のフレームワークのようなKDの様々な距離関数と互換性があるだけでなく、語彙に関係なく、任意の2つのLLM間のKDもサポートしています。
参考スコア（独自算出の注目度）: 39.798007795604676
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Knowledge distillation (KD) is known as a promising solution to compress large language models (LLMs) via transferring their knowledge to smaller models. During this process, white-box KD methods usually minimize the distance between the output distributions of the two models so that more knowledge can be transferred. However, in the current white-box KD framework, the output distributions are from the respective output spaces of the two models, using their own prediction heads. We argue that the space discrepancy will lead to low similarity between the teacher model and the student model on both representation and distribution levels. Furthermore, this discrepancy also hinders the KD process between models with different vocabularies, which is common for current LLMs. To address these issues, we propose a dual-space knowledge distillation (DSKD) framework that unifies the output spaces of the two models for KD. On the basis of DSKD, we further develop a cross-model attention mechanism, which can automatically align the representations of the two models with different vocabularies. Thus, our framework is not only compatible with various distance functions for KD (e.g., KL divergence) like the current framework, but also supports KD between any two LLMs regardless of their vocabularies. Experiments on task-agnostic instruction-following benchmarks show that DSKD significantly outperforms the current white-box KD framework with various distance functions, and also surpasses existing KD methods for LLMs with different vocabularies.
Abstract（参考訳）: 知識蒸留(KD)は、より大きな言語モデル(LLM)を圧縮し、その知識をより小さなモデルに伝達する、有望なソリューションとして知られている。この過程において、ホワイトボックスKD法は通常、2つのモデルの出力分布間の距離を最小化し、より多くの知識を伝達することができる。しかし、現在のWhite-box KDフレームワークでは、出力分布は2つのモデルの出力空間からなり、それぞれの予測ヘッドを使用する。我々は,空間差が教師モデルと生徒モデルとの表現と分布の両レベルでの類似度を低くすると主張している。さらに、この相違により、現在のLLMに共通する異なる語彙を持つモデル間のKDプロセスも妨げられる。これらの問題に対処するため,KDの2つのモデルの出力空間を統一する二空間知識蒸留(DSKD)フレームワークを提案する。 DSKDに基づいて,2つのモデルの表現を異なる語彙で自動的に整列するクロスモデルアテンション機構をさらに発展させる。したがって、我々のフレームワークは、現在のフレームワークのようなKD(例えば、KLの発散)の様々な距離関数と互換性があるだけでなく、語彙に関係なく任意の2つのLLM間のKDもサポートしている。タスクに依存しない命令追従ベンチマークの実験では、DSKDは様々な距離関数を持つ現在のWhite-box KDフレームワークよりも大幅に優れており、異なる語彙を持つLLMの既存のKDメソッドよりも優れていた。

論文の概要: Dual-Space Knowledge Distillation for Large Language Models

関連論文リスト