DeViT: Decomposing Vision Transformers for Collaborative Inference in
Edge Devices
- URL: http://arxiv.org/abs/2309.05015v1
- Date: Sun, 10 Sep 2023 12:26:17 GMT
- Title: DeViT: Decomposing Vision Transformers for Collaborative Inference in
Edge Devices
- Authors: Guanyu Xu, Zhiwei Hao, Yong Luo, Han Hu, Jianping An, Shiwen Mao
- Abstract summary: Vision transformer (ViT) has achieved state-of-the-art performance on multiple computer vision benchmarks.
ViT models suffer from vast amounts of parameters and high computation cost, leading to difficult deployment on resource-constrained edge devices.
We propose a collaborative inference framework termed DeViT to facilitate edge deployment by decomposing large ViTs.
- Score: 42.89175608336226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have witnessed the great success of vision transformer (ViT),
which has achieved state-of-the-art performance on multiple computer vision
benchmarks. However, ViT models suffer from vast amounts of parameters and high
computation cost, leading to difficult deployment on resource-constrained edge
devices. Existing solutions mostly compress ViT models to a compact model but
still cannot achieve real-time inference. To tackle this issue, we propose to
explore the divisibility of transformer structure, and decompose the large ViT
into multiple small models for collaborative inference at edge devices. Our
objective is to achieve fast and energy-efficient collaborative inference while
maintaining comparable accuracy compared with large ViTs. To this end, we first
propose a collaborative inference framework termed DeViT to facilitate edge
deployment by decomposing large ViTs. Subsequently, we design a
decomposition-and-ensemble algorithm based on knowledge distillation, termed
DEKD, to fuse multiple small decomposed models while dramatically reducing
communication overheads, and handle heterogeneous models by developing a
feature matching module to promote the imitations of decomposed models from the
large ViT. Extensive experiments for three representative ViT backbones on four
widely-used datasets demonstrate our method achieves efficient collaborative
inference for ViTs and outperforms existing lightweight ViTs, striking a good
trade-off between efficiency and accuracy. For example, our DeViTs improves
end-to-end latency by 2.89$\times$ with only 1.65% accuracy sacrifice using
CIFAR-100 compared to the large ViT, ViT-L/16, on the GPU server. DeDeiTs
surpasses the recent efficient ViT, MobileViT-S, by 3.54% in accuracy on
ImageNet-1K, while running 1.72$\times$ faster and requiring 55.28% lower
energy consumption on the edge device.
Related papers
- GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous
Structured Pruning for Vision Transformer [76.2625311630021]
Vision transformers (ViTs) have shown very impressive empirical performance in various computer vision tasks.
To mitigate this challenging problem, structured pruning is a promising solution to compress model size and enable practical efficiency.
We propose GOHSP, a unified framework of Graph and Optimization-based Structured Pruning for ViT models.
arXiv Detail & Related papers (2023-01-13T00:40:24Z) - HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision
Transformers [35.92244135055901]
HeatViT is an image-adaptive token pruning framework for vision transformers (ViTs) on embedded FPGAs.
HeatViT can achieve 0.7%$sim$8.9% higher accuracy compared to existing ViT pruning studies.
HeatViT can achieve more than 28.4%$sim computation reduction, for various widely used ViTs.
arXiv Detail & Related papers (2022-11-15T13:00:43Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - Super Vision Transformer [131.4777773281238]
Experimental results on ImageNet demonstrate that our SuperViT can considerably reduce the computational costs of ViT models with even performance increase.
Our SuperViT significantly outperforms existing studies on efficient vision transformers.
arXiv Detail & Related papers (2022-05-23T15:42:12Z) - Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models.
To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training.
We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.