Fugu-MT 論文翻訳(概要): Scaling Vision Transformers to 22 Billion Parameters

論文の概要: Scaling Vision Transformers to 22 Billion Parameters

arxiv url: http://arxiv.org/abs/2302.05442v1
Date: Fri, 10 Feb 2023 18:58:21 GMT
ステータス: 翻訳完了
システム内更新日: 2023-02-13 14:58:58.674744
Title: Scaling Vision Transformers to 22 Billion Parameters
Title（参考訳）: ビジョントランスを22億パラメータに拡張する
Authors: Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Paveti\'c, Dustin Tran, Thomas Kipf, Mario Lu\v{c}i\'c, Xiaohua Zhai, Daniel Keysers, Jeremiah Harmsen, Neil Houlsby
Abstract要約: Vision Transformers (ViT) は画像とビデオのモデリングに同じアーキテクチャを導入したが、まだほぼ同じ程度に拡張されていない。本稿では,22Bパラメータ ViT (ViT-22B) の高効率かつ安定なトレーニング法を提案する。 ViT-22Bは、視界における"LLMライクな"スケーリングの可能性を示し、そこに到達するための重要なステップを提供する。
参考スコア（独自算出の注目度）: 140.67853929168382
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.
Abstract（参考訳）: Transformerのスケーリングは、言語モデルの画期的な機能を生み出した。現在、最大の大規模言語モデル(LLM)は100B以上のパラメータを含む。 Vision Transformers (ViT) は画像とビデオのモデリングに同じアーキテクチャを導入したが、これらのアーキテクチャは未だほぼ同じ程度に拡張されておらず、最大密度のViTは4Bパラメータを含む(Chen et al., 2022)。本研究では,22Bパラメータ ViT (ViT-22B) の高効率かつ安定なトレーニング法を提案し,その結果のモデルについて多種多様な実験を行った。下流タスク(しばしば凍結した特徴に対する軽量線形モデルで評価される)で評価すると、ViT-22Bはスケールによる性能向上を示す。さらに、フェアネスとパフォーマンスのトレードオフの改善、形状/テクスチャバイアスによる人間の視覚知覚への最先端のアライメント、ロバストネスの改善など、スケールの他の興味深いメリットも観察する。 ViT-22Bは、視界における"LLMライクな"スケーリングの可能性を示し、そこに到達するための重要なステップを提供する。

論文の概要: Scaling Vision Transformers to 22 Billion Parameters

関連論文リスト