Fugu-MT 論文翻訳(概要): 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

論文の概要: 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

arxiv url: http://arxiv.org/abs/2102.02888v1
Date: Thu, 4 Feb 2021 21:02:19 GMT
ステータス: 翻訳完了
システム内更新日: 2021-02-08 23:35:38.293214
Title: 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed
Title（参考訳）: 1ビットAdam:Adamの収束速度によるコミュニケーション効率の高い大規模トレーニング
Authors: Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He
Abstract要約: 通信は、ネットワーク帯域幅が限られている標準のTCPインターコネクトを持つコモディティシステムにおいて、大きなボトルネックとなっている。最も効果的な方法の1つは、誤り補償圧縮であり、1ビット圧縮でも堅牢な収束速度を提供する。我々は,通信容量を最大5倍に削減し,スケーラビリティを向上し,非圧縮Adamと同じ収束速度を提供する1ビットAdamを提案する。
参考スコア（独自算出の注目度）: 39.23129626683372
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system capabilities. From a system standpoint, communication has become a major bottleneck, especially on commodity systems with standard TCP interconnects that offer limited network bandwidth. Communication compression is an important technique to reduce training time on such systems. One of the most effective methods is error-compensated compression, which offers robust convergence speed even under 1-bit compression. However, state-of-the-art error compensation techniques only work with basic optimizers like SGD and momentum SGD, which are linearly dependent on the gradients. They do not work with non-linear gradient-based optimizers like Adam, which offer state-of-the-art convergence efficiency and accuracy for models like BERT. In this paper, we propose 1-bit Adam that reduces the communication volume by up to $5\times$, offers much better scalability, and provides the same convergence speed as uncompressed Adam. Our key finding is that Adam's variance (non-linear term) becomes stable (after a warmup phase) and can be used as a fixed precondition for the rest of the training (compression phase). Experiments on up to 256 GPUs show that 1-bit Adam enables up to $3.3\times$ higher throughput for BERT-Large pre-training and up to $2.9\times$ higher throughput for SQuAD fine-tuning. In addition, we provide theoretical analysis for our proposed work.
Abstract（参考訳）: 大規模モデル(BERTやGPT-3など)のスケーラブルなトレーニングには、モデル設計、アーキテクチャ、システム機能に根ざした慎重な最適化が必要です。システムの観点からは、通信は特にネットワーク帯域幅が限られている標準TCPインターコネクトを持つコモディティシステムにおいて、大きなボトルネックとなっている。通信圧縮は、そのようなシステムの訓練時間を短縮する重要な技術である。最も効果的な方法の1つは、1ビット圧縮下でも堅牢な収束速度を提供するエラー補償圧縮です。しかし、最先端のエラー補償技術は、勾配に依存するsgdやmomentum sgdのような基本的な最適化器でのみ動作する。 bertのようなモデルに対して最先端の収束効率と精度を提供するadamのような非線形勾配に基づく最適化では動作しない。本稿では,通信容量を最大5\times$に削減し,スケーラビリティを向上し,非圧縮Adamと同じ収束速度を提供する1ビットAdamを提案する。我々の重要な発見は、アダムの分散(非線形項)が(ウォームアップフェーズの後)安定し、残りのトレーニング(圧縮フェーズ)の固定プレコンディションとして使用できることである。最大256 gpu での実験では、1ビット adam は bert-large pre-training で最大3.3\times$、 squad fine-tuningで最大2.9\times$ high throughput となる。また,提案する研究に対して理論的分析を行う。

論文の概要: 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

関連論文リスト