Related papers: Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies

Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies

URL: http://arxiv.org/abs/2503.02891v2
Date: Wed, 30 Apr 2025 13:55:51 GMT
Title: Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies
Authors: Shaibal Saha, Lanyu Xu,
Abstract summary: Vision transformers (ViTs) have emerged as powerful and promising techniques for computer vision tasks.<n>High computational complexity and memory demands pose challenges for deployment on resource-constrained edge devices.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, vision transformers (ViTs) have emerged as powerful and promising techniques for computer vision tasks such as image classification, object detection, and segmentation. Unlike convolutional neural networks (CNNs), which rely on hierarchical feature extraction, ViTs treat images as sequences of patches and leverage self-attention mechanisms. However, their high computational complexity and memory demands pose significant challenges for deployment on resource-constrained edge devices. To address these limitations, extensive research has focused on model compression techniques and hardware-aware acceleration strategies. Nonetheless, a comprehensive review that systematically categorizes these techniques and their trade-offs in accuracy, efficiency, and hardware adaptability for edge deployment remains lacking. This survey bridges this gap by providing a structured analysis of model compression techniques, software tools for inference on edge, and hardware acceleration strategies for ViTs. We discuss their impact on accuracy, efficiency, and hardware adaptability, highlighting key challenges and emerging research directions to advance ViT deployment on edge platforms, including graphics processing units (GPUs), application-specific integrated circuit (ASICs), and field-programmable gate arrays (FPGAs). The goal is to inspire further research with a contemporary guide on optimizing ViTs for efficient deployment on edge devices.

Related papers

Deep Learning-based Techniques for Integrated Sensing and Communication Systems: State-of-the-Art, Challenges, and Opportunities [54.12860202362483]
This article comprehensively reviews recent developments and research on deep learning-based (DL-based) techniques for integrated sensing and communication (ISAC) systems.<n>ISAC is regarded as a key enabler for 6G and beyond networks, as many emerging applications, such as vehicular networks and industrial robotics, necessitate both sensing and communication capabilities.<n>As an alternative to conventional techniques, DL-based techniques offer efficient and near-optimal solutions with reduced computational complexity.
arXiv Detail & Related papers (2025-08-23T22:27:51Z)
Token Compression Meets Compact Vision Transformers: A Survey and Comparative Evaluation for Edge AI [26.45869748408205]
Token compression techniques have emerged as powerful tools for Vision Transformer (ViT) inference in computer vision.<n>We present the first systematic taxonomy and comparative study of token compression methods.<n>Our experiments reveal that while token compression methods are effective for general-purpose ViTs, they often underperform when directly applied to compact designs.
arXiv Detail & Related papers (2025-07-13T16:26:05Z)
FCA2: Frame Compression-Aware Autoencoder for Modular and Fast Compressed Video Super-Resolution [68.77813885751308]
State-of-the-art (SOTA) compressed video super-resolution (CVSR) models face persistent challenges, including prolonged inference time, complex training pipelines, and reliance on auxiliary information.<n>We propose an efficient and scalable solution inspired by the structural and statistical similarities between hyperspectral images (HSI) and video data.<n>Our approach introduces a compression-driven dimensionality reduction strategy that reduces computational complexity, accelerates inference, and enhances the extraction of temporal information across frames.
arXiv Detail & Related papers (2025-06-13T07:59:52Z)
Image Recognition with Online Lightweight Vision Transformer: A Survey [53.005965123414576]
This paper surveys various online strategies for generating lightweight vision transformers for image recognition.<n>We evaluate the relevant exploration for each topic on the ImageNet-1K benchmark.<n>We propose future research directions and potential challenges in the lightweighting of vision transformers.
arXiv Detail & Related papers (2025-05-06T02:07:54Z)
On Accelerating Edge AI: Optimizing Resource-Constrained Environments [1.7355861031903428]
Resource-constrained edge deployments demand AI solutions that balance high performance with stringent compute, memory, and energy limitations.<n>We present a comprehensive overview of the primary strategies for accelerating deep learning models under such constraints.
arXiv Detail & Related papers (2025-01-25T01:37:03Z)
Efficient Detection Framework Adaptation for Edge Computing: A Plug-and-play Neural Network Toolbox Enabling Edge Deployment [59.61554561979589]
Edge computing has emerged as a key paradigm for deploying deep learning-based object detection in time-sensitive scenarios. Existing edge detection methods face challenges: difficulty balancing detection precision with lightweight models, limited adaptability, and insufficient real-world validation. We propose the Edge Detection Toolbox (ED-TOOLBOX), which utilizes generalizable plug-and-play components to adapt object detection models for edge environments.
arXiv Detail & Related papers (2024-12-24T07:28:10Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
Optimizing Vision Transformers with Data-Free Knowledge Transfer [8.323741354066474]
Vision transformers (ViTs) have excelled in various computer vision tasks due to their superior ability to capture long-distance dependencies. We propose compressing large ViT models using Knowledge Distillation (KD), which is implemented data-free to circumvent limitations related to data availability.
arXiv Detail & Related papers (2024-08-12T07:03:35Z)
Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers [56.37495946212932]
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs) This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs.
arXiv Detail & Related papers (2024-07-25T16:35:46Z)
AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community. We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z)
CHOSEN: Compilation to Hardware Optimization Stack for Efficient Vision Transformer Inference [4.523939613157408]
Vision Transformers (ViTs) represent a groundbreaking shift in machine learning approaches to computer vision. This paper introduces CHOSEN, a software-hardware co-design framework to address these challenges and offer an automated framework for ViT deployment on the FPGAs. ChoSEN achieves a 1.5x and 1.42x improvement in the throughput on the DeiT-S and DeiT-B models.
arXiv Detail & Related papers (2024-07-17T16:56:06Z)
Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey [6.04807281619171]
Vision Transformers (ViTs) have recently garnered considerable attention, emerging as a promising alternative to convolutional neural networks (CNNs) in several vision-related applications. This article provides a comprehensive survey of ViTs quantization and its hardware acceleration.
arXiv Detail & Related papers (2024-05-01T04:32:07Z)
A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking [19.65897437342896]
Vision Transformer (ViT) architectures are becoming increasingly popular and widely employed to tackle computer vision applications. This paper mathematically defines the strategies used to make Vision Transformer efficient, describes and discusses state-of-the-art methodologies, and analyzes their performances over different application scenarios.
arXiv Detail & Related papers (2023-09-05T08:21:16Z)
GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous Structured Pruning for Vision Transformer [76.2625311630021]
Vision transformers (ViTs) have shown very impressive empirical performance in various computer vision tasks. To mitigate this challenging problem, structured pruning is a promising solution to compress model size and enable practical efficiency. We propose GOHSP, a unified framework of Graph and Optimization-based Structured Pruning for ViT models.
arXiv Detail & Related papers (2023-01-13T00:40:24Z)
An Extendable, Efficient and Effective Transformer-based Object Detector [95.06044204961009]
We integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector. We extend it to ViDT+ to support joint-task learning for object detection and instance segmentation.
arXiv Detail & Related papers (2022-04-17T09:27:45Z)
Video Coding for Machine: Compact Visual Representation Compression for Intelligent Collaborative Analytics [101.35754364753409]
Video Coding for Machines (VCM) is committed to bridging to an extent separate research tracks of video/image compression and feature compression. This paper summarizes VCM methodology and philosophy based on existing academia and industrial efforts.
arXiv Detail & Related papers (2021-10-18T12:42:13Z)
Towards AIOps in Edge Computing Environments [60.27785717687999]
This paper describes the system design of an AIOps platform which is applicable in heterogeneous, distributed environments. It is feasible to collect metrics with a high frequency and simultaneously run specific anomaly detection algorithms directly on edge devices.
arXiv Detail & Related papers (2021-02-12T09:33:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.