Fugu-MT 論文翻訳(概要): Understanding Knowledge Distillation in Post-Training: When It Helps and When It Fails

論文の概要: Understanding Knowledge Distillation in Post-Training: When It Helps and When It Fails

arxiv url: http://arxiv.org/abs/2606.22942v1
Date: Mon, 22 Jun 2026 07:19:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 03:25:06.918904
Title: Understanding Knowledge Distillation in Post-Training: When It Helps and When It Fails
Title（参考訳）: 後学習における知識蒸留の理解:それが助けになる時と失敗する時
Authors: Xin Liu, Simin Ma, Shujian Liu, Song Wang, Sathish Reddy Indurthi, Haoyun Deng, Lu Wang, Kaiqiang Song,
Abstract要約: 大規模言語モデル(LLM)は多くのタスクにおいて高いパフォーマンスを達成するが、その高い計算コストはリソース制約のある環境への展開を制限する。知識蒸留(KD)は、より大規模な教師モデルからより小さな学生モデルに知識を移すことによって、実践的なソリューションを提供する。
参考スコア（独自算出の注目度）: 16.73080036450313
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) achieve strong performance across many tasks, but their high computational cost limits deployment in resource-constrained environments. Knowledge Distillation (KD) offers a practical solution by transferring knowledge from a teacher model of a larger size to a smaller student model. While prior work has mainly examined task-specific or small-scale settings, the post-training stage for building general instruction-following models has received limited attention. In this paper, we conduct a systematic study of KD in post-training using the large-scale Tulu 3 dataset. We find that KD outperforms supervised fine-tuning (SFT) in low-data regimes, but its advantage diminishes as more training data is added. Distilling from a stronger instruction-tuned teacher restores substantial gains even with abundant data, indicating that KD remains effective when the teacher provides knowledge that the student cannot easily acquire from the training data alone. We further study domain-specific, low-resource scenarios and propose a two-stage KD strategy that leverages synthetic teacher-labeled data followed by refinement on human annotations. This method consistently improves student performance, providing practical guidance for building compact models in data-scarce environments.
Abstract（参考訳）: 大規模言語モデル(LLM)は多くのタスクにおいて高いパフォーマンスを達成するが、その高い計算コストはリソース制約のある環境への展開を制限する。知識蒸留(KD)は、より大規模な教師モデルからより小さな学生モデルに知識を移すことによって、実践的なソリューションを提供する。従来,タスク特化や小規模設定を主に検討してきたが,一般教示フォローモデル構築の訓練段階には注目が集まっていない。本稿では,大規模Tulu 3データセットを用いた後学習におけるKDの体系的研究を行う。 KDは、低データレシエーションにおける教師付き微調整(SFT)よりも優れていますが、より多くのトレーニングデータが加えられるにつれて、その優位性は低下します。教師が学習データのみから容易に取得できない知識を提供すると、教師はKDが有効であることを示す。さらに、ドメイン固有の低リソースシナリオについて検討し、合成教師ラベルデータを利用した2段階のKD戦略を提案し、その後、人間のアノテーションを改良する。この手法は生徒のパフォーマンスを継続的に改善し、データ共有環境でコンパクトなモデルを構築するための実践的なガイダンスを提供する。

論文の概要: Understanding Knowledge Distillation in Post-Training: When It Helps and When It Fails

関連論文リスト