Fugu-MT 論文翻訳(概要): A ConvNet for the 2020s

論文の概要: A ConvNet for the 2020s

arxiv url: http://arxiv.org/abs/2201.03545v1
Date: Mon, 10 Jan 2022 18:59:10 GMT
ステータス: 翻訳完了
システム内更新日: 2022-01-11 17:24:10.553508
Title: A ConvNet for the 2020s
Title（参考訳）: 2020年代のConvNet
Authors: Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell and Saining Xie
Abstract要約: ビジョントランスフォーマー(ViT)は、最先端の画像分類モデルとしてすぐにConvNetsに取って代わった。これは、いくつかのConvNetプリエントを再導入した階層型トランスフォーマーであり、トランスフォーマーは一般的なビジョンバックボーンとして実用的である。本研究では、設計空間を再検討し、純粋なConvNetが達成できることの限界をテストする。
参考スコア（独自算出の注目度）: 94.89735578018099
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.
Abstract（参考訳）: 視覚認識の"Roaring 20s"は視覚変換器(ViT)の導入で始まり、コンネットを最先端の画像分類モデルとして置き換えた。一方、Vanilla ViTは、オブジェクト検出やセマンティックセグメンテーションといった一般的なコンピュータビジョンタスクに適用する場合、困難に直面します。階層型トランスフォーマー(例えば、Swin Transformer)は、いくつかのConvNetプリミティブを再導入し、トランスフォーマーを一般的なビジョンバックボーンとして実用化し、様々なビジョンタスクにおいて顕著なパフォーマンスを示す。しかし、そのようなハイブリッドアプローチの有効性は、畳み込みの固有の帰納的バイアスよりもトランスフォーマーの内在的優位性に大きく寄与している。本研究では、設計空間を再検討し、純粋なConvNetが達成できることの限界をテストする。視覚トランスフォーマーの設計に向けた標準のresnetを徐々に「近代化」し、その過程で性能の差に寄与するいくつかの重要なコンポーネントを発見します。この探索の結果は、ConvNeXtと呼ばれる純粋なConvNetモデルのファミリーである。 ConvNeXtは標準のConvNetモジュールから完全に構成されており、精度とスケーラビリティの点でTransformerと良好に競合し、87.8%のImageNet top-1精度とCOCO検出とADE20KセグメンテーションにおけるSwing Transformerよりも優れており、標準のConvNetの単純さと効率性を維持している。

論文の概要: A ConvNet for the 2020s

関連論文リスト