Fugu-MT 論文翻訳(概要): DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices

論文の概要: DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices

arxiv url: http://arxiv.org/abs/2309.05015v1
Date: Sun, 10 Sep 2023 12:26:17 GMT
ステータス: 翻訳完了
システム内更新日: 2023-09-12 15:09:16.171926
Title: DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices
Title（参考訳）: devit: エッジデバイスにおける協調推論のためのビジョントランスフォーマーの分解
Authors: Guanyu Xu, Zhiwei Hao, Yong Luo, Han Hu, Jianping An, Shiwen Mao
Abstract要約: ビジョントランス (ViT) は、複数のコンピュータビジョンベンチマークで最先端のパフォーマンスを達成した。 ViTモデルは膨大なパラメータと高い計算コストに悩まされ、リソース制約されたエッジデバイスへのデプロイが困難になる。本稿では,大規模なViTを分解してエッジ展開を容易にするために,DeViTと呼ばれる協調推論フレームワークを提案する。
参考スコア（独自算出の注目度）: 42.89175608336226
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent years have witnessed the great success of vision transformer (ViT), which has achieved state-of-the-art performance on multiple computer vision benchmarks. However, ViT models suffer from vast amounts of parameters and high computation cost, leading to difficult deployment on resource-constrained edge devices. Existing solutions mostly compress ViT models to a compact model but still cannot achieve real-time inference. To tackle this issue, we propose to explore the divisibility of transformer structure, and decompose the large ViT into multiple small models for collaborative inference at edge devices. Our objective is to achieve fast and energy-efficient collaborative inference while maintaining comparable accuracy compared with large ViTs. To this end, we first propose a collaborative inference framework termed DeViT to facilitate edge deployment by decomposing large ViTs. Subsequently, we design a decomposition-and-ensemble algorithm based on knowledge distillation, termed DEKD, to fuse multiple small decomposed models while dramatically reducing communication overheads, and handle heterogeneous models by developing a feature matching module to promote the imitations of decomposed models from the large ViT. Extensive experiments for three representative ViT backbones on four widely-used datasets demonstrate our method achieves efficient collaborative inference for ViTs and outperforms existing lightweight ViTs, striking a good trade-off between efficiency and accuracy. For example, our DeViTs improves end-to-end latency by 2.89$\times$ with only 1.65% accuracy sacrifice using CIFAR-100 compared to the large ViT, ViT-L/16, on the GPU server. DeDeiTs surpasses the recent efficient ViT, MobileViT-S, by 3.54% in accuracy on ImageNet-1K, while running 1.72$\times$ faster and requiring 55.28% lower energy consumption on the edge device.
Abstract（参考訳）: 近年では、複数のコンピュータビジョンベンチマークで最先端のパフォーマンスを達成したビジョントランスフォーマー(ViT)が大きな成功を収めている。しかし、ViTモデルは膨大なパラメータと高い計算コストに悩まされ、リソース制約のエッジデバイスへの展開が困難になる。既存のソリューションは主にViTモデルをコンパクトなモデルに圧縮するが、リアルタイム推論はできない。そこで本研究では, 変圧器構造の違いを解明し, 大規模ViTを複数の小さなモデルに分解し, エッジデバイスでの協調推論を提案する。本研究の目的は,大規模vitと同等の精度を維持しつつ,高速かつエネルギー効率の高い協調推論を実現することにある。そこで我々はまず,大規模なViTを分解してエッジ展開を容易にする,DeViTと呼ばれる協調推論フレームワークを提案する。続いて,コミュニケーションオーバーヘッドを劇的に低減しつつ,複数の小さな分解モデルを融合させ,大きなvitから分解モデルの模倣を促進するために特徴マッチングモジュールを開発し,異種モデルを扱う,知識蒸留に基づく分解・センスアルゴリズムを設計・設計する。 4つの広範に使用されるデータセットにおける3つのvitバックボーンの広範な実験により、vitの効率的な協調推論が可能となり、既存の軽量vitよりも優れており、効率と精度のトレードオフとなる。例えば、当社のDeViTsは、GPUサーバ上の大きなViTであるViT-L/16と比較して、CIFAR-100を使用した精度が1.65%しか犠牲にすることなく、エンドツーエンドのレイテンシを2.89$\times$に改善しています。 DeDeiTsは、最新の効率的なViTであるMobileViT-Sを、ImageNet-1Kで3.54%の精度で上回り、1.72$\times$高速で、エッジデバイスで55.28%のエネルギー消費を必要とする。

論文の概要: DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices

関連論文リスト