EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense
Prediction
- URL: http://arxiv.org/abs/2205.14756v6
- Date: Tue, 6 Feb 2024 02:57:35 GMT
- Title: EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense
Prediction
- Authors: Han Cai, Junyan Li, Muyan Hu, Chuang Gan, Song Han
- Abstract summary: This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention.
Our multi-scale linear attention achieves the global receptive field and multi-scale learning.
EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
- Score: 67.11722682878722
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: High-resolution dense prediction enables many appealing real-world
applications, such as computational photography, autonomous driving, etc.
However, the vast computational cost makes deploying state-of-the-art
high-resolution dense prediction models on hardware devices difficult. This
work presents EfficientViT, a new family of high-resolution vision models with
novel multi-scale linear attention. Unlike prior high-resolution dense
prediction models that rely on heavy softmax attention, hardware-inefficient
large-kernel convolution, or complicated topology structure to obtain good
performances, our multi-scale linear attention achieves the global receptive
field and multi-scale learning (two desirable features for high-resolution
dense prediction) with only lightweight and hardware-efficient operations. As
such, EfficientViT delivers remarkable performance gains over previous
state-of-the-art models with significant speedup on diverse hardware platforms,
including mobile CPU, edge GPU, and cloud GPU. Without performance loss on
Cityscapes, our EfficientViT provides up to 13.9$\times$ and 6.2$\times$ GPU
latency reduction over SegFormer and SegNeXt, respectively. For
super-resolution, EfficientViT delivers up to 6.4x speedup over Restormer while
providing 0.11dB gain in PSNR. For Segment Anything, EfficientViT delivers
48.9x higher throughput on A100 GPU while achieving slightly better zero-shot
instance segmentation performance on COCO.
Related papers
- Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers [56.37495946212932]
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs)
This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs.
arXiv Detail & Related papers (2024-07-25T16:35:46Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems [6.8519529064678375]
Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs.
To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max.
This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities.
arXiv Detail & Related papers (2023-10-04T13:00:53Z) - EfficientViT: Memory Efficient Vision Transformer with Cascaded Group
Attention [44.148667664413004]
We propose a family of high-speed vision transformers named EfficientViT.
We find that the speed of existing transformer models is commonly bounded by memory inefficient operations.
To address this, we present a cascaded group attention module feeding attention heads with different splits.
arXiv Detail & Related papers (2023-05-11T17:59:41Z) - InceptionNeXt: When Inception Meets ConvNeXt [167.61042926444105]
We build a series of networks, namely IncepitonNeXt, which not only enjoy high throughputs but also maintain competitive performance.
InceptionNeXt achieves 1.6x higher training throughputs than ConvNeX-T, as well as attains 0.2% top-1 accuracy improvement on ImageNet-1K.
arXiv Detail & Related papers (2023-03-29T17:59:58Z) - Efficient Large-scale Scene Representation with a Hybrid of
High-resolution Grid and Plane Features [44.25307397334988]
Existing neural radiance fields (NeRF) methods for large-scale scene modeling require days of training using multiple GPUs.
We introduce a new and efficient hybrid feature representation for NeRF that fuses the 3D hash-grids and high-resolution 2D dense plane features.
Based on this hybrid representation, we propose a fast optimization NeRF variant, called GP-NeRF, that achieves better rendering results while maintaining a compact model size.
arXiv Detail & Related papers (2023-03-06T10:04:50Z) - Data-Model-Circuit Tri-Design for Ultra-Light Video Intelligence on Edge
Devices [90.30316433184414]
We propose a data-model-hardware tri-design framework for high- throughput, low-cost, and high-accuracy MOT on HD video stream.
Compared to the state-of-the-art MOT baseline, our tri-design approach can achieve 12.5x latency reduction, 20.9x effective frame rate improvement, 5.83x lower power, and 9.78x better energy efficiency, without much accuracy drop.
arXiv Detail & Related papers (2022-10-16T16:21:40Z) - Efficient Heterogeneous Video Segmentation at the Edge [2.4378845585726903]
We introduce an efficient video segmentation system for resource-limited edge devices leveraging heterogeneous compute.
Specifically, we design network models by searching across multiple dimensions of specifications for the neural architectures.
We analyze and optimize the heterogeneous data flows in our systems across the CPU, the GPU and the NPU.
arXiv Detail & Related papers (2022-08-24T17:01:09Z) - DeepSpeed Inference: Enabling Efficient Inference of Transformer Models
at Unprecedented Scale [20.558091867632445]
DeepSpeed Inference is a comprehensive system solution for transformer model inference.
It reduces latency by up to 7.3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1.5x for throughput-oriented scenarios.
It can inference 25x larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over $50%$ of A6000 peak)
arXiv Detail & Related papers (2022-06-30T18:01:08Z) - Revisiting Multi-Scale Feature Fusion for Semantic Segmentation [90.32746095413447]
In this paper, we demonstrate that neither high internal resolution nor atrous convolutions are necessary for accurate semantic segmentation.
We develop a simplified segmentation model, named ESeg, which has neither high internal resolution nor expensive atrous convolutions.
Our simple method can achieve better accuracy with faster speed than prior art across multiple datasets.
arXiv Detail & Related papers (2022-03-23T19:14:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.