Compressing Vision Transformers for Low-Resource Visual Learning
- URL: http://arxiv.org/abs/2309.02617v1
- Date: Tue, 5 Sep 2023 23:33:39 GMT
- Title: Compressing Vision Transformers for Low-Resource Visual Learning
- Authors: Eric Youn, Sai Mitheran J, Sanjana Prabhu, Siyuan Chen
- Abstract summary: Vision transformer (ViT) and its variants offer state-of-the-art accuracy in tasks such as image classification, object detection, and semantic segmentation.
These models are large and computation-heavy, making their deployment on mobile and edge scenarios limited.
We aim to take a step toward bringing vision transformers to the edge by utilizing popular model compression techniques such as distillation, pruning, and quantization.
- Score: 7.662469543657508
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision transformer (ViT) and its variants have swept through visual learning
leaderboards and offer state-of-the-art accuracy in tasks such as image
classification, object detection, and semantic segmentation by attending to
different parts of the visual input and capturing long-range spatial
dependencies. However, these models are large and computation-heavy. For
instance, the recently proposed ViT-B model has 86M parameters making it
impractical for deployment on resource-constrained devices. As a result, their
deployment on mobile and edge scenarios is limited. In our work, we aim to take
a step toward bringing vision transformers to the edge by utilizing popular
model compression techniques such as distillation, pruning, and quantization.
Our chosen application environment is an unmanned aerial vehicle (UAV) that
is battery-powered and memory-constrained, carrying a single-board computer on
the scale of an NVIDIA Jetson Nano with 4GB of RAM. On the other hand, the UAV
requires high accuracy close to that of state-of-the-art ViTs to ensure safe
object avoidance in autonomous navigation, or correct localization of humans in
search-and-rescue. Inference latency should also be minimized given the
application requirements. Hence, our target is to enable rapid inference of a
vision transformer on an NVIDIA Jetson Nano (4GB) with minimal accuracy loss.
This allows us to deploy ViTs on resource-constrained devices, opening up new
possibilities in surveillance, environmental monitoring, etc. Our
implementation is made available at https://github.com/chensy7/efficient-vit.
Related papers
- Optimizing Vision Transformers with Data-Free Knowledge Transfer [8.323741354066474]
Vision transformers (ViTs) have excelled in various computer vision tasks due to their superior ability to capture long-distance dependencies.
We propose compressing large ViT models using Knowledge Distillation (KD), which is implemented data-free to circumvent limitations related to data availability.
arXiv Detail & Related papers (2024-08-12T07:03:35Z) - Tiny Multi-Agent DRL for Twins Migration in UAV Metaverses: A Multi-Leader Multi-Follower Stackelberg Game Approach [57.15309977293297]
The synergy between Unmanned Aerial Vehicles (UAVs) and metaverses is giving rise to an emerging paradigm named UAV metaverses.
We propose a tiny machine learning-based Stackelberg game framework based on pruning techniques for efficient UT migration in UAV metaverses.
arXiv Detail & Related papers (2024-01-18T02:14:13Z) - Denoising Vision Transformers [43.03068202384091]
We propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT)
In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis.
In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision.
arXiv Detail & Related papers (2024-01-05T18:59:52Z) - PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - Vision Transformers for Mobile Applications: A Short Survey [0.0]
Vision Transformers (ViTs) have demonstrated state-of-the-art performance on many Computer Vision Tasks.
deploying large-scale ViTs is resource-consuming and impossible for many mobile devices.
We look into a few ViTs specifically designed for mobile applications and observe that they modify the transformer's architecture or are built around the combination of CNN and transformer.
arXiv Detail & Related papers (2023-05-30T19:12:08Z) - ViTA: A Vision Transformer Inference Accelerator for Edge Applications [4.3469216446051995]
Vision Transformer models, such as ViT, Swin Transformer, and Transformer-in-Transformer, have recently gained significant traction in computer vision tasks.
They are compute-heavy and difficult to deploy in resource-constrained edge devices.
We propose ViTA - a hardware accelerator for inference of vision transformer models, targeting resource-constrained edge computing devices.
arXiv Detail & Related papers (2023-02-17T19:35:36Z) - TransVisDrone: Spatio-Temporal Transformer for Vision-based
Drone-to-Drone Detection in Aerial Videos [57.92385818430939]
Drone-to-drone detection using visual feed has crucial applications, such as detecting drone collisions, detecting drone attacks, or coordinating flight with other drones.
Existing methods are computationally costly, follow non-end-to-end optimization, and have complex multi-stage pipelines, making them less suitable for real-time deployment on edge devices.
We propose a simple yet effective framework, itTransVisDrone, that provides an end-to-end solution with higher computational efficiency.
arXiv Detail & Related papers (2022-10-16T03:05:13Z) - An Extendable, Efficient and Effective Transformer-based Object Detector [95.06044204961009]
We integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector.
ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector.
We extend it to ViDT+ to support joint-task learning for object detection and instance segmentation.
arXiv Detail & Related papers (2022-04-17T09:27:45Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.