Vis-TOP: Visual Transformer Overlay Processor
- URL: http://arxiv.org/abs/2110.10957v1
- Date: Thu, 21 Oct 2021 08:11:12 GMT
- Title: Vis-TOP: Visual Transformer Overlay Processor
- Authors: Wei Hu, Dian Xu, Zimeng Fan, Fang Liu, Yanxiang He
- Abstract summary: Transformer has achieved good results in Natural Language Processing (NLP) and has also started to expand into Computer Vision (CV)
We propose Vis-TOP, an overlay processor for various visual Transformer models.
Vis-TOP summarizes the characteristics of all visual Transformer models and implements a three-layer and two-level transformation structure.
- Score: 9.80151619872144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, Transformer has achieved good results in Natural Language
Processing (NLP) and has also started to expand into Computer Vision (CV).
Excellent models such as the Vision Transformer and Swin Transformer have
emerged. At the same time, the platform for Transformer models was extended to
embedded devices to meet some resource-sensitive application scenarios.
However, due to the large number of parameters, the complex computational flow
and the many different structural variants of Transformer models, there are a
number of issues that need to be addressed in its hardware design. This is both
an opportunity and a challenge. We propose Vis-TOP (Visual Transformer Overlay
Processor), an overlay processor for various visual Transformer models. It
differs from coarse-grained overlay processors such as CPU, GPU, NPE, and from
fine-grained customized designs for a specific model. Vis-TOP summarizes the
characteristics of all visual Transformer models and implements a three-layer
and two-level transformation structure that allows the model to be switched or
changed freely without changing the hardware architecture. At the same time,
the corresponding instruction bundle and hardware architecture are designed in
three-layer and two-level transformation structure. After quantization of Swin
Transformer tiny model using 8-bit fixed points (fix_8), we implemented an
overlay processor on the ZCU102. Compared to GPU, the TOP throughput is 1.5x
higher. Compared to the existing Transformer accelerators, our throughput per
DSP is between 2.2x and 11.7x higher than others. In a word, the approach in
this paper meets the requirements of real-time AI in terms of both resource
consumption and inference speed. Vis-TOP provides a cost-effective and
power-effective solution based on reconfigurable devices for computer vision at
the edge.
Related papers
- Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST)
CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background.
Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z) - Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications.
The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate.
There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z) - ViTA: A Vision Transformer Inference Accelerator for Edge Applications [4.3469216446051995]
Vision Transformer models, such as ViT, Swin Transformer, and Transformer-in-Transformer, have recently gained significant traction in computer vision tasks.
They are compute-heavy and difficult to deploy in resource-constrained edge devices.
We propose ViTA - a hardware accelerator for inference of vision transformer models, targeting resource-constrained edge computing devices.
arXiv Detail & Related papers (2023-02-17T19:35:36Z) - Reversible Vision Transformers [74.3500977090597]
Reversible Vision Transformers are a memory efficient architecture for visual recognition.
We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants.
We find that the additional computational burden of recomputing activations is more than overcome for deeper models.
arXiv Detail & Related papers (2023-02-09T18:59:54Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - Hierarchical Transformers Are More Efficient Language Models [19.061388006885686]
Transformer models yield impressive results on many NLP and sequence modeling tasks.
Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs.
We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences.
arXiv Detail & Related papers (2021-10-26T14:00:49Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.