Light-Weight Vision Transformer with Parallel Local and Global
Self-Attention
- URL: http://arxiv.org/abs/2307.09120v1
- Date: Tue, 18 Jul 2023 10:07:06 GMT
- Title: Light-Weight Vision Transformer with Parallel Local and Global
Self-Attention
- Authors: Nikolas Ebert, Laurenz Reichardt, Didier Stricker, Oliver
Wasenm\"uller
- Abstract summary: We redesign Vision Transformer PLG-ViT to be more compact and efficient.
We achieve 79.5$%$ top-1 accuracy on the ImageNet-1K classification benchmark.
Our networks demonstrate great performance on general vision benchmarks.
- Score: 11.255962936937744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While transformer architectures have dominated computer vision in recent
years, these models cannot easily be deployed on hardware with limited
resources for autonomous driving tasks that require real-time-performance.
Their computational complexity and memory requirements limits their use,
especially for applications with high-resolution inputs. In our work, we
redesign the powerful state-of-the-art Vision Transformer PLG-ViT to a much
more compact and efficient architecture that is suitable for such tasks. We
identify computationally expensive blocks in the original PLG-ViT architecture
and propose several redesigns aimed at reducing the number of parameters and
floating-point operations. As a result of our redesign, we are able to reduce
PLG-ViT in size by a factor of 5, with a moderate drop in performance. We
propose two variants, optimized for the best trade-off between parameter count
to runtime as well as parameter count to accuracy. With only 5 million
parameters, we achieve 79.5$\%$ top-1 accuracy on the ImageNet-1K
classification benchmark. Our networks demonstrate great performance on general
vision benchmarks like COCO instance segmentation. In addition, we conduct a
series of experiments, demonstrating the potential of our approach in solving
various tasks specifically tailored to the challenges of autonomous driving and
transportation.
Related papers
- Optimizing Vision Transformers with Data-Free Knowledge Transfer [8.323741354066474]
Vision transformers (ViTs) have excelled in various computer vision tasks due to their superior ability to capture long-distance dependencies.
We propose compressing large ViT models using Knowledge Distillation (KD), which is implemented data-free to circumvent limitations related to data availability.
arXiv Detail & Related papers (2024-08-12T07:03:35Z) - CHOSEN: Compilation to Hardware Optimization Stack for Efficient Vision Transformer Inference [4.523939613157408]
Vision Transformers (ViTs) represent a groundbreaking shift in machine learning approaches to computer vision.
This paper introduces CHOSEN, a software-hardware co-design framework to address these challenges and offer an automated framework for ViT deployment on the FPGAs.
ChoSEN achieves a 1.5x and 1.42x improvement in the throughput on the DeiT-S and DeiT-B models.
arXiv Detail & Related papers (2024-07-17T16:56:06Z) - LPViT: Low-Power Semi-structured Pruning for Vision Transformers [42.91130720962956]
Vision transformers (ViTs) have emerged as a promising alternative to convolutional neural networks for image analysis tasks.
One significant drawback of ViTs is their resource-intensive nature, leading to increased memory footprint, complexity, and power consumption.
We introduce a new block-structured pruning to address the resource-intensive issue for ViTs, offering a balanced trade-off between accuracy and hardware acceleration.
arXiv Detail & Related papers (2024-07-02T08:58:19Z) - A survey on efficient vision transformers: algorithms, techniques, and
performance benchmarking [19.65897437342896]
Vision Transformer (ViT) architectures are becoming increasingly popular and widely employed to tackle computer vision applications.
This paper mathematically defines the strategies used to make Vision Transformer efficient, describes and discusses state-of-the-art methodologies, and analyzes their performances over different application scenarios.
arXiv Detail & Related papers (2023-09-05T08:21:16Z) - TurboViT: Generating Fast Vision Transformers via Generative
Architecture Search [74.24393546346974]
Vision transformers have shown unprecedented levels of performance in tackling various visual perception tasks in recent years.
There has been significant research recently on the design of efficient vision transformer architecture.
In this study, we explore the generation of fast vision transformer architecture designs via generative architecture search.
arXiv Detail & Related papers (2023-08-22T13:08:29Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency.
We also introduce a novel fine-grained joint search strategy for transformer models.
This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z) - SPViT: Enabling Faster Vision Transformers via Soft Token Pruning [38.10083471492964]
Pruning, a traditional model compression paradigm for hardware efficiency, has been widely applied in various DNN structures.
We propose a computation-aware soft pruning framework, which can be set up on vanilla Transformers of both flatten and CNN-type structures.
Our framework significantly reduces the computation cost of ViTs while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2021-12-27T20:15:25Z) - A Simple Single-Scale Vision Transformer for Object Localization and
Instance Segmentation [79.265315267391]
We propose a simple and compact ViT architecture called Universal Vision Transformer (UViT)
UViT achieves strong performance on object detection and instance segmentation tasks.
arXiv Detail & Related papers (2021-12-17T20:11:56Z) - Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage.
We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction.
Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z) - Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks.
specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples.
We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.