Fugu-MT 論文翻訳(概要): TurboViT: Generating Fast Vision Transformers via Generative Architecture Search

論文の概要: TurboViT: Generating Fast Vision Transformers via Generative Architecture Search

arxiv url: http://arxiv.org/abs/2308.11421v1
Date: Tue, 22 Aug 2023 13:08:29 GMT
ステータス: 翻訳完了
システム内更新日: 2023-08-23 17:57:02.324640
Title: TurboViT: Generating Fast Vision Transformers via Generative Architecture Search
Title（参考訳）: TurboViT: 生成アーキテクチャ検索による高速ビジョン変換器の生成
Authors: Alexander Wong, Saad Abbasi, Saeejith Nair
Abstract要約: 近年、視覚変換器は様々な視覚認知タスクに対処する上で、前例のないレベルの性能を示している。近年,効率的な視覚変換器の設計に関する研究が盛んに行われている。本研究では,生成型アーキテクチャサーチによる高速ビジョントランスフォーマーアーキテクチャの設計について検討する。
参考スコア（独自算出の注目度）: 74.24393546346974
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision transformers have shown unprecedented levels of performance in tackling various visual perception tasks in recent years. However, the architectural and computational complexity of such network architectures have made them challenging to deploy in real-world applications with high-throughput, low-memory requirements. As such, there has been significant research recently on the design of efficient vision transformer architectures. In this study, we explore the generation of fast vision transformer architecture designs via generative architecture search (GAS) to achieve a strong balance between accuracy and architectural and computational efficiency. Through this generative architecture search process, we create TurboViT, a highly efficient hierarchical vision transformer architecture design that is generated around mask unit attention and Q-pooling design patterns. The resulting TurboViT architecture design achieves significantly lower architectural computational complexity (>2.47$\times$ smaller than FasterViT-0 while achieving same accuracy) and computational complexity (>3.4$\times$ fewer FLOPs and 0.9% higher accuracy than MobileViT2-2.0) when compared to 10 other state-of-the-art efficient vision transformer network architecture designs within a similar range of accuracy on the ImageNet-1K dataset. Furthermore, TurboViT demonstrated strong inference latency and throughput in both low-latency and batch processing scenarios (>3.21$\times$ lower latency and >3.18$\times$ higher throughput compared to FasterViT-0 for low-latency scenario). These promising results demonstrate the efficacy of leveraging generative architecture search for generating efficient transformer architecture designs for high-throughput scenarios.
Abstract（参考訳）: 近年、視覚変換器は様々な視覚認知タスクに取り組む際に、前例のない性能を示した。しかし、そのようなネットワークアーキテクチャのアーキテクチャと計算の複雑さは、高スループットで低メモリ要求の実際のアプリケーションにデプロイすることを困難にしている。このように、近年、効率的な視覚トランスフォーマーアーキテクチャの設計に関する研究が盛んである。本研究では,GAS(Generative Architecture Search)を用いた高速ビジョントランスフォーマーアーキテクチャの設計について検討し,精度とアーキテクチャ,計算効率のバランスを強くする。この生成的アーキテクチャ探索プロセスを通じて,マスクユニットの注意とQプールの設計パターンに基づいて生成される高効率な階層型視覚トランスフォーマーアーキテクチャであるTurboViTを作成する。結果のTurboViTアーキテクチャ設計は、ImageNet-1Kデータセットで同様の精度で、他の10の最先端の効率的なビジョントランスフォーマーネットワークアーキテクチャと比較すると、アーキテクチャの複雑さ(=2.47$\times$同じ精度でFasterViT-0より小さい)と計算の複雑さ(→3.4$\times$より小さいFLOPと0.9%高い精度)を著しく低下させる。さらに、TurboViTは低レイテンシとバッチ処理の両方のシナリオで強い推論レイテンシとスループットを示した(低レイテンシではFasterViT-0に比べて3.21$\times$低レイテンシと3.18$\times$高スループット)。これらの有望な結果は、高スループットシナリオのための効率的なトランスフォーマーアーキテクチャ設計を生成するために生成的アーキテクチャ探索を利用する効果を示している。

論文の概要: TurboViT: Generating Fast Vision Transformers via Generative Architecture Search

関連論文リスト