SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution
Vision Transformer
- URL: http://arxiv.org/abs/2303.17605v1
- Date: Thu, 30 Mar 2023 17:59:58 GMT
- Title: SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution
Vision Transformer
- Authors: Xuanyao Chen, Zhijian Liu, Haotian Tang, Li Yi, Hang Zhao, Song Han
- Abstract summary: We introduce SparseViT that revisits activation sparsity for recent window-based vision transformers (ViTs)
SparseViT achieves speedups of 1.5x, 1.4x, and 1.3x compared to its dense counterpart in monocular 3D object detection, 2D instance segmentation, and 2D semantic segmentation.
- Score: 42.440822037774645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-resolution images enable neural networks to learn richer visual
representations. However, this improved performance comes at the cost of
growing computational complexity, hindering their usage in latency-sensitive
applications. As not all pixels are equal, skipping computations for
less-important regions offers a simple and effective measure to reduce the
computation. This, however, is hard to be translated into actual speedup for
CNNs since it breaks the regularity of the dense convolution workload. In this
paper, we introduce SparseViT that revisits activation sparsity for recent
window-based vision transformers (ViTs). As window attentions are naturally
batched over blocks, actual speedup with window activation pruning becomes
possible: i.e., ~50% latency reduction with 60% sparsity. Different layers
should be assigned with different pruning ratios due to their diverse
sensitivities and computational costs. We introduce sparsity-aware adaptation
and apply the evolutionary search to efficiently find the optimal layerwise
sparsity configuration within the vast search space. SparseViT achieves
speedups of 1.5x, 1.4x, and 1.3x compared to its dense counterpart in monocular
3D object detection, 2D instance segmentation, and 2D semantic segmentation,
respectively, with negligible to no loss of accuracy.
Related papers
- MaskVD: Region Masking for Efficient Video Object Detection [11.759503235646696]
Video tasks are compute-heavy and pose a challenge when deploying in real-time applications.
This paper presents a strategy for masking regions in video frames.
By leveraging extracted features from previous frames, ViT backbones directly benefit from region masking.
arXiv Detail & Related papers (2024-07-16T08:01:49Z) - ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention [33.00435765051738]
We introduce Gated Linear Attention (GLA) for vision, leveraging its superior hardware-awareness and efficiency.
Our proposed model, ViG, offers a favorable trade-off in accuracy, parameters, and FLOPs on ImageNet and downstream tasks.
ViG-T uses 5.2$times$ fewer FLOPs, saves 90% GPU memory, runs 4.8$times$ faster, and achieves 20.7% higher top-1 accuracy than DeiT-T.
arXiv Detail & Related papers (2024-05-28T17:59:21Z) - Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like
Speed [42.861344584752]
Previous detector-free matcher LoFTR has shown remarkable matching capability in handling large-viewpoint change and texture-poor scenarios.
We revisit its design choices and derive multiple improvements for both efficiency and accuracy.
Our method can achieve higher accuracy compared with competitive semi-dense matchers.
arXiv Detail & Related papers (2024-03-07T18:58:40Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks.
It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping.
It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z) - SwiftFormer: Efficient Additive Attention for Transformer-based
Real-time Mobile Vision Applications [98.90623605283564]
We introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications.
We build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed.
Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2.
arXiv Detail & Related papers (2023-03-27T17:59:58Z) - Efficient Latency-Aware CNN Depth Compression via Two-Stage Dynamic
Programming [15.458305667190256]
We propose a novel depth compression algorithm which targets general convolution operations.
We achieve $1.41times$ speed-up with $0.11%p accuracy gain in MobileNetV2-1.0 on the ImageNet.
arXiv Detail & Related papers (2023-01-28T13:08:54Z) - TorchSparse: Efficient Point Cloud Inference Engine [24.541195361633523]
We introduce TorchSparse, a high-performance point cloud inference engine.
TorchSparse directly optimize the two bottlenecks of sparse convolution: irregular computation and data movement.
It achieves 1.6x and 1.5x measured end-to-end speedup over the state-of-the-art MinkowskiEngine and SpConv, respectively.
arXiv Detail & Related papers (2022-04-21T17:58:30Z) - SALISA: Saliency-based Input Sampling for Efficient Video Object
Detection [58.22508131162269]
We propose SALISA, a novel non-uniform SALiency-based Input SAmpling technique for video object detection.
We show that SALISA significantly improves the detection of small objects.
arXiv Detail & Related papers (2022-04-05T17:59:51Z) - FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks.
Current networks often occupy large number of parameters and require heavy computation costs.
Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.