Fugu-MT 論文翻訳(概要): Weight subcloning: direct initialization of transformers using larger pretrained ones

論文の概要: Weight subcloning: direct initialization of transformers using larger pretrained ones

arxiv url: http://arxiv.org/abs/2312.09299v1
Date: Thu, 14 Dec 2023 19:08:56 GMT
ステータス: 翻訳完了
システム内更新日: 2023-12-18 17:57:50.028873
Title: Weight subcloning: direct initialization of transformers using larger pretrained ones
Title（参考訳）: 重量サブクローニング:大型予行変圧器を用いた変圧器の直接初期化
Authors: Mohammad Samragh, Mehrdad Farajtabar, Sachin Mehta, Raviteja Vemulapalli, Fartash Faghri, Devang Naik, Oncel Tuzel, Mohammad Rastegari
Abstract要約: 本稿では,事前学習されたモデルの知識をより小さな変種に伝達する手法を提案する。ウェイト・サブクロニングは、より大きな事前訓練モデルからウェイトを初期化することにより、スケールダウン・トランスフォーマーのトレーニングを高速化する。我々は、次のトークン予測のために設計された画像分類と言語モデルにおいて、視覚変換器の4倍高速なトレーニングを実現する。
参考スコア（独自算出の注目度）: 42.056148990349094
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training large transformer models from scratch for a target task requires lots of data and is computationally demanding. The usual practice of transfer learning overcomes this challenge by initializing the model with weights of a pretrained model of the same size and specification to increase the convergence and training speed. However, what if no pretrained model of the required size is available? In this paper, we introduce a simple yet effective technique to transfer the knowledge of a pretrained model to smaller variants. Our approach called weight subcloning expedites the training of scaled-down transformers by initializing their weights from larger pretrained models. Weight subcloning involves an operation on the pretrained model to obtain the equivalent initialized scaled-down model. It consists of two key steps: first, we introduce neuron importance ranking to decrease the embedding dimension per layer in the pretrained model. Then, we remove blocks from the transformer model to match the number of layers in the scaled-down network. The result is a network ready to undergo training, which gains significant improvements in training speed compared to random initialization. For instance, we achieve 4x faster training for vision transformers in image classification and language models designed for next token prediction.
Abstract（参考訳）: ターゲットタスクのためにスクラッチから大きなトランスフォーマーモデルをトレーニングするには、大量のデータが必要であり、計算的に要求される。トランスファーラーニングの通常の実践は、同じサイズの事前訓練モデルと仕様の重み付けでモデルを初期化し、収束とトレーニング速度を高めることで、この課題を克服する。しかし、必要サイズの事前学習されたモデルがない場合はどうだろう? 本稿では,事前学習モデルの知識をより小さな変種に伝達する,単純かつ効果的な手法を提案する。重みサブクローニング(weight subcloning)と呼ばれるアプローチは、より大型の事前訓練モデルから重みを初期化することで、スケールダウントランスフォーマーのトレーニングを迅速化する。ウェイトサブクローニングは、事前訓練されたモデルで等価な初期化スケールダウンモデルを得るための操作を含む。まず,ニューロンの重要度ランキングを導入し,事前学習したモデルにおける層毎の埋め込み次元を減少させる。そして、スケールダウンネットワークの層数に一致するように、トランスモデルからブロックを除去する。その結果、トレーニングの準備が整ったネットワークとなり、ランダム初期化に比べてトレーニング速度が大幅に向上する。例えば、画像分類および次のトークン予測用に設計された言語モデルにおいて、視覚トランスフォーマーのトレーニングを4倍高速化する。

論文の概要: Weight subcloning: direct initialization of transformers using larger pretrained ones

関連論文リスト