Fugu-MT 論文翻訳(概要): AutoDisc: Automatic Distillation Schedule for Large Language Model Compression

論文の概要: AutoDisc: Automatic Distillation Schedule for Large Language Model Compression

arxiv url: http://arxiv.org/abs/2205.14570v1
Date: Sun, 29 May 2022 04:22:48 GMT
ステータス: 翻訳完了
システム内更新日: 2022-05-31 14:17:23.834157
Title: AutoDisc: Automatic Distillation Schedule for Large Language Model Compression
Title（参考訳）: AutoDisc: 大規模言語モデル圧縮のための自動蒸留スケジュール
Authors: Chen Zhang, Yang Yang, Qifan Wang, Jiahao Liu, Jingang Wang, Wei Wu, Dawei Song
Abstract要約: 大規模言語モデル圧縮のための自動蒸留スケジュール(AutoDisc)を提案する。特にAutoDiscは、まず、グリッドとプルーニングで異なるスケールの教師アシスタント候補のセットを指定する。 AutoDiscは言語理解ベンチマークGLUEで広範な実験によって評価されている。
参考スコア（独自算出の注目度）: 20.705045295332237
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Driven by the teacher-student paradigm, knowledge distillation is one of the de facto ways for language model compression. Recent studies have uncovered that conventional distillation is less effective when facing a large capacity gap between the teacher and the student, and introduced teacher assistant-based distillation to bridge the gap. As a connection, the scale and the performance of the teacher assistant is crucial for transferring the knowledge from the teacher to the student. However, existing teacher assistant-based methods manually select the scale of the teacher assistant, which fails to identify the teacher assistant with the optimal scale-performance tradeoff. To this end, we propose an Automatic Distillation Schedule (AutoDisc) for large language model compression. In particular, AutoDisc first specifies a set of teacher assistant candidates at different scales with gridding and pruning, and then optimizes all candidates in an once-for-all optimization with two approximations. The best teacher assistant scale is automatically selected according to the scale-performance tradeoff. AutoDisc is evaluated with an extensive set of experiments on a language understanding benchmark GLUE. Experimental results demonstrate the improved performance and applicability of our AutoDisc. We further apply AutoDisc on a language model with over one billion parameters and show the scalability of AutoDisc.
Abstract（参考訳）: 教師-学生パラダイムによって駆動される知識蒸留は、言語モデル圧縮の事実上の方法の1つである。近年の研究では、教師と学生の容量ギャップに直面する場合、従来の蒸留は効果が低いことが判明し、そのギャップを埋めるために教師助手による蒸留を導入した。関係として、教師から生徒に知識を伝達するためには、教師助手の規模とパフォーマンスが不可欠である。しかし、既存の教師アシスタントベース手法では、教師アシスタントのスケールを手動で選択するが、最適なスケールパフォーマンストレードオフでは教師アシスタントの識別に失敗する。そこで本研究では,大規模言語モデル圧縮のための自動蒸留スケジュール(AutoDisc)を提案する。特にAutoDiscは、まずグリッドとプルーニングで異なるスケールの教師アシスタント候補を指定し、それから2つの近似で全ての候補を1対1の最適化で最適化する。最高の教師アシスタントスケールは、スケールパフォーマンストレードオフに応じて自動的に選択される。 AutoDiscは言語理解ベンチマークGLUEで広範な実験によって評価されている。実験の結果, オートディスクの性能と適用性が向上した。さらに10億以上のパラメータを持つ言語モデルにAutoDiscを適用し、AutoDiscのスケーラビリティを示す。

関連論文リスト

Enhancing Reasoning Capabilities in SLMs with Reward Guided Dataset Distillation [0.0]
本稿では,報酬誘導型データセット蒸留フレームワークAdvDistillを提案する。我々は,教師からの複数の世代(応答)を各プロンプトに利用し,ルールベースの検証に基づいて報酬を割り当てる。これらの様々な、通常は分散された報酬は、学生モデルを訓練する際の重みとなる。
論文参考訳（メタデータ） (2025-06-25T20:07:47Z)
Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation [84.38105530043741]
本稿では, 学生の蒸留を教員の蒸留と整合させて, 蒸留に先立って行うワームアップ蒸留法を提案する。 7つのベンチマークの実験は、ウォームアップ・ディスティルが蒸留に適したウォームアップの学生を提供することを示した。
論文参考訳（メタデータ） (2025-02-17T12:58:12Z)
Distillation Scaling Laws [9.828322497230053]
我々は,計算予算と学生と教師の割り当てに基づいて,蒸留モデルの性能を推定する蒸留スケーリング法を提案する。本研究は, 大規模蒸留によるリスクを低減し, 教員モデルと学生モデルの両方に計算割り当てを行うことで, 学生のパフォーマンスを最大化できることを示した。
論文参考訳（メタデータ） (2025-02-12T17:52:47Z)
MiniPLM: Knowledge Distillation for Pre-Training Language Models [109.83741809808483]
MiniPLM は、大規模な教師 LM を用いて、学生言語モデル (LM) を事前訓練するためのフレームワークである。効率性のために、MiniPLMはオフラインの教師推論を実行する。柔軟性のために、MiniPLMはトレーニングコーパスのみで動作し、モデルファミリ間のKDを可能にする。
論文参考訳（メタデータ） (2024-10-22T17:40:32Z)
Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty [0.0]
我々は,GPT-2と小さなLLaMAモデルからなるアンサンブルを,発達的に予測可能な10MワードのBabyLMデータセットで訓練した。我々は, 58MパラメータのLLaMAモデルを用いて蒸留を行った。
論文参考訳（メタデータ） (2023-08-03T20:20:01Z)
Lifting the Curse of Capacity Gap in Distilling Language Models [19.370268407987652]
我々は,学生に余分なパラメータを課す最小限の専門家(MiniMoE)の混合を提案するが,追加の推論計算はほとんど導入しない。圧縮レートが$sim$50$times$で、MiniMoEは教師の$sim$95% GLUEスコアを保存する。
論文参考訳（メタデータ） (2023-05-20T07:30:55Z)
HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers [49.79405257763856]
本稿では,タスク非依存蒸留に焦点をあてる。これは、計算コストとメモリフットプリントを小さくして、様々なタスクで簡単に微調整できるコンパクトな事前訓練モデルを生成する。本稿では, 反復刈り込みによる新規なタスク非依存蒸留法であるHomotopic Distillation (HomoDistil)を提案する。
論文参考訳（メタデータ） (2023-02-19T17:37:24Z)
ERNIE 3.0 Tiny: Frustratingly Simple Method to Improve Task-Agnostic Distillation Generalization [36.338614215561805]
タスクに依存しない知識蒸留は、リソース制約のあるシナリオにおいて、大きな事前訓練された言語モデルをデプロイする問題に対処しようとする。我々は,タスク非依存蒸留におけるマルチタスク学習を活用して,結果の一般化を推し進めることができることを示す。
論文参考訳（メタデータ） (2023-01-09T15:12:50Z)
Less is More: Task-aware Layer-wise Distillation for Language Model Compression [68.30497162547766]
層ワイド蒸留は、大きなモデル(すなわち教師モデル)を小さなモデルに圧縮する強力なツールである。我々は,新しいタスク対応ライEr-wise Distillation (TED)を提案する。 TEDは、各レイヤで生徒と教師の隠された表現を調整するためにタスク認識フィルタを設計する。
論文参考訳（メタデータ） (2022-10-04T03:36:53Z)
Knowledge Distillation via Weighted Ensemble of Teaching Assistants [18.593268785143426]
知識蒸留は、教師と呼ばれる大きなモデルから学生と呼ばれる小さなモデルに知識を移す過程である。教師と生徒のネットワークサイズギャップが大きくなると、学生ネットワークの性能は低下する。学生モデル(より小さいモデル)は,複数の指導支援モデルを用いてさらに改善できることが示されている。
論文参考訳（メタデータ） (2022-06-23T22:50:05Z)
Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge Distillation [70.92135839545314]
本研究では,教師の持つ特徴の一部を,特徴蒸留前の先行知識として統合した動的事前知識(DPK)を提案する。 DPKは,教員モデルと生徒モデルのパフォーマンスを正に相関させ,より大きな教員を適用することで生徒の精度をさらに高めることができる。
論文参考訳（メタデータ） (2022-06-13T11:52:13Z)
One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers [54.146208195806636]
本稿では,事前学習型言語モデル圧縮のためのMT-BERTという多言語知識蒸留フレームワークを提案する。 MT-BERTは、複数の教師PLMから高品質な学生モデルを訓練できることを示す。 PLMの圧縮におけるMT-BERTの有効性を3つのベンチマークデータセットで検証した。
論文参考訳（メタデータ） (2021-06-02T08:42:33Z)
Distilling Object Detectors with Task Adaptive Regularization [97.52935611385179]
現在の最先端のオブジェクト検出器は高い計算コストを犠牲にしており、ローエンドデバイスへのデプロイが困難である。より大規模な教師モデルから知識を伝達することで、より小さな学生ネットワークを訓練することを目的とした知識蒸留は、モデル小型化のための有望な解決策の1つである。
論文参考訳（メタデータ） (2020-06-23T15:58:22Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。