Fugu-MT 論文翻訳(概要): Less is More: Task-aware Layer-wise Distillation for Language Model Compression

論文の概要: Less is More: Task-aware Layer-wise Distillation for Language Model Compression

arxiv url: http://arxiv.org/abs/2210.01351v3
Date: Mon, 5 Jun 2023 22:40:20 GMT
ステータス: 翻訳完了
システム内更新日: 2023-06-07 21:43:44.871530
Title: Less is More: Task-aware Layer-wise Distillation for Language Model Compression
Title（参考訳）: less is more: 言語モデル圧縮のためのタスクアウェア層別蒸留
Authors: Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, Tuo Zhao
Abstract要約: 層ワイド蒸留は、大きなモデル(すなわち教師モデル)を小さなモデルに圧縮する強力なツールである。我々は,新しいタスク対応ライEr-wise Distillation (TED)を提案する。 TEDは、各レイヤで生徒と教師の隠された表現を調整するためにタスク認識フィルタを設計する。
参考スコア（独自算出の注目度）: 68.30497162547766
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Layer-wise distillation is a powerful tool to compress large models (i.e. teacher models) into small ones (i.e., student models). The student distills knowledge from the teacher by mimicking the hidden representations of the teacher at every intermediate layer. However, layer-wise distillation is difficult. Since the student has a smaller model capacity than the teacher, it is often under-fitted. Furthermore, the hidden representations of the teacher contain redundant information that the student does not necessarily need for the target task's learning. To address these challenges, we propose a novel Task-aware layEr-wise Distillation (TED). TED designs task-aware filters to align the hidden representations of the student and the teacher at each layer. The filters select the knowledge that is useful for the target task from the hidden representations. As such, TED reduces the knowledge gap between the two models and helps the student to fit better on the target task. We evaluate TED in two scenarios: continual pre-training and fine-tuning. TED demonstrates significant and consistent improvements over existing distillation methods in both scenarios. Code is available at https://github.com/cliang1453/task-aware-distillation.
Abstract（参考訳）: 層ワイド蒸留は、大きなモデル(すなわち教師モデル)を小さなモデル(すなわち学生モデル)に圧縮する強力なツールである。生徒は、中間層ごとに教師の隠れた表現を模倣して、教師からの知識を蒸留する。しかし, 層間蒸留は困難である。生徒は教師よりもモデル能力が小さいため、しばしば不適合である。さらに、教師の隠れた表現には、生徒が必ずしも対象タスクの学習に必要としない冗長な情報が含まれている。これらの課題に対処するために,新しいタスク対応ライEr-wise Distillation (TED)を提案する。 tedは、各層で生徒と教師の隠れた表現を調整するタスク対応フィルタを設計している。フィルタは、隠れた表現からターゲットタスクに有用な知識を選択する。そのため、TEDは2つのモデルの知識ギャップを減らし、学生が目的のタスクに適合するのに役立つ。 TEDを連続的な事前学習と微調整の2つのシナリオで評価した。 TEDは、両方のシナリオで既存の蒸留法よりも顕著で一貫した改善を示している。コードはhttps://github.com/cliang1453/task-aware-distillationで入手できる。

関連論文リスト

Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
教師表現のフィルタリングと重み付けのための自己蒸留(SSD)訓練戦略を導入し,タスク関連表現のみから抽出する。 UCR Archiveのウェアラブル/バイオサインデータセット、HARデータセット、画像分類データセットなどの実世界の感情コンピューティングに関する実験結果は、提案したSSD手法が最先端の手法より優れていることを示している。
論文参考訳（メタデータ） (2025-04-19T14:08:56Z)
Triplet Knowledge Distillation [73.39109022280878]
知識蒸留(Knowledge Distillation)では、教師は一般的に生徒よりもはるかに大きく、教師の解法は生徒が学ぶのが難しくなる。模擬困難を緩和するため,TriKDという三重項知識蒸留機構を導入する。
論文参考訳（メタデータ） (2023-05-25T12:12:31Z)
ERNIE 3.0 Tiny: Frustratingly Simple Method to Improve Task-Agnostic Distillation Generalization [36.338614215561805]
タスクに依存しない知識蒸留は、リソース制約のあるシナリオにおいて、大きな事前訓練された言語モデルをデプロイする問題に対処しようとする。我々は,タスク非依存蒸留におけるマルチタスク学習を活用して,結果の一般化を推し進めることができることを示す。
論文参考訳（メタデータ） (2023-01-09T15:12:50Z)
Representation Consolidation for Training Expert Students [54.90754502493968]
マルチヘッド多タスク蒸留法は,タスク固有の教師の表現を集約し,下流のパフォーマンスを向上させるのに十分であることを示す。また,本手法では,複数のドメインで訓練された複数の教師の表現的知識を1つのモデルに組み合わせることができる。
論文参考訳（メタデータ） (2021-07-16T17:58:18Z)
Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
本稿では,教師の知識を学生とより整合させる,新たな学生依存型蒸留法である知識一貫型蒸留を提案する。この手法は非常に柔軟で,他の最先端手法と容易に組み合わせることができる。
論文参考訳（メタデータ） (2021-03-31T06:52:20Z)
ALP-KD: Attention-Based Layer Projection for Knowledge Distillation [30.896957367331137]
2つのニューラルネットワーク、すなわち教師と学生は、トレーニング中に一緒に結合されます。教師ネットワークは信頼できる予測者であり、生徒はその予測を模倣しようとします。このような状況下では、蒸留は最終予測でのみ行われるが、生徒は教師が内部成分を監督する利益も得る。
論文参考訳（メタデータ） (2020-12-27T22:30:13Z)
Progressive Network Grafting for Few-Shot Knowledge Distillation [60.38608462158474]
本稿では, 数ショットデータに適した二段蒸留方式を提案する。最初のステップでは、生徒のブロックを1つずつ教師に移植し、移植されたブロックのパラメータと他の教師ブロックのパラメータを学習します。 CIFAR10, CIFAR100, ILSVRC-2012で, わずか数サンプルで, 満足のいく結果が得られることを実証した。
論文参考訳（メタデータ） (2020-12-09T08:34:36Z)
Reducing the Teacher-Student Gap via Spherical Knowledge Disitllation [67.75526580926149]
知識蒸留は、はるかに大きなものから写像関数を学習することにより、コンパクトで効果的なモデルを得ることを目的としている。本研究では,教師と学生の信頼のギャップを調査し,容量ギャップ問題について検討する。知識蒸留には信頼度は必要とせず,学生が自信を習得せざるを得ない場合には,学生のパフォーマンスを損なう可能性がある。
論文参考訳（メタデータ） (2020-10-15T03:03:36Z)
Channel Distillation: Channel-Wise Attention for Knowledge Distillation [3.6269274596116476]
本稿では,2つの蒸留方法と損失崩壊戦略を含む新しい蒸留法を提案する。まず、チャンネル蒸留(CD)が教師から生徒にチャネル情報を転送する。第二に、指導的知識蒸留(GKD)は、生徒が教師の正しい出力を模倣することしかできない。
論文参考訳（メタデータ） (2020-06-02T14:59:50Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。