FastViT: A Fast Hybrid Vision Transformer using Structural
Reparameterization
- URL: http://arxiv.org/abs/2303.14189v2
- Date: Thu, 17 Aug 2023 21:10:59 GMT
- Title: FastViT: A Fast Hybrid Vision Transformer using Structural
Reparameterization
- Authors: Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel,
Anurag Ranjan
- Abstract summary: We introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off.
We show that our model is 3.5x faster than CMT, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset.
- Score: 14.707312504365376
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent amalgamation of transformer and convolutional designs has led to
steady improvements in accuracy and efficiency of the models. In this work, we
introduce FastViT, a hybrid vision transformer architecture that obtains the
state-of-the-art latency-accuracy trade-off. To this end, we introduce a novel
token mixing operator, RepMixer, a building block of FastViT, that uses
structural reparameterization to lower the memory access cost by removing
skip-connections in the network. We further apply train-time
overparametrization and large kernel convolutions to boost accuracy and
empirically show that these choices have minimal effect on latency. We show
that - our model is 3.5x faster than CMT, a recent state-of-the-art hybrid
transformer architecture, 4.9x faster than EfficientNet, and 1.9x faster than
ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset. At
similar latency, our model obtains 4.2% better Top-1 accuracy on ImageNet than
MobileOne. Our model consistently outperforms competing architectures across
several tasks -- image classification, detection, segmentation and 3D mesh
regression with significant improvement in latency on both a mobile device and
a desktop GPU. Furthermore, our model is highly robust to out-of-distribution
samples and corruptions, improving over competing robust models. Code and
models are available at https://github.com/apple/ml-fastvit.
Related papers
- SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design [5.962184741057505]
This paper aims to address computational redundancy at all design levels in a memory-efficient manner.
We discover that using larger-stride patchify stem not only reduces memory access costs but also achieves competitive performance.
We introduce SHViT, a Single-Head Vision Transformer that obtains the state-of-the-art speed-accuracy tradeoff.
arXiv Detail & Related papers (2024-01-29T09:12:23Z) - FMViT: A multiple-frequency mixing Vision Transformer [17.609263967586926]
We propose an efficient hybrid ViT architecture named FMViT.
This approach blends high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively.
We demonstrate that FMViT surpasses existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks.
arXiv Detail & Related papers (2023-11-09T19:33:50Z) - EfficientViT: Memory Efficient Vision Transformer with Cascaded Group
Attention [44.148667664413004]
We propose a family of high-speed vision transformers named EfficientViT.
We find that the speed of existing transformer models is commonly bounded by memory inefficient operations.
To address this, we present a cascaded group attention module feeding attention heads with different splits.
arXiv Detail & Related papers (2023-05-11T17:59:41Z) - SwiftFormer: Efficient Additive Attention for Transformer-based
Real-time Mobile Vision Applications [98.90623605283564]
We introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications.
We build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed.
Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2.
arXiv Detail & Related papers (2023-03-27T17:59:58Z) - Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency.
We also introduce a novel fine-grained joint search strategy for transformer models.
This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z) - EfficientFormer: Vision Transformers at MobileNet Speed [43.93223983817965]
Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks.
ViT-based models are generally times slower than lightweight convolutional networks.
Recent efforts try to reduce the complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory.
arXiv Detail & Related papers (2022-06-02T17:51:03Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - EfficientNetV2: Smaller Models and Faster Training [91.77432224225221]
This paper introduces EfficientNetV2, a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models.
We use a combination of training-aware neural architecture search and scaling, to jointly optimize training speed and parameter efficiency.
Our experiments show that EfficientNetV2 models train much faster than state-of-the-art models while being up to 6.8x smaller.
arXiv Detail & Related papers (2021-04-01T07:08:36Z) - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets.
For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales.
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z) - Making DensePose fast and light [78.49552144907513]
Existing neural network models capable of solving this task are heavily parameterized.
To enable Dense Pose inference on the end device with current models, one needs to support an expensive server-side infrastructure and have a stable internet connection.
In this work, we target the problem of redesigning the DensePose R-CNN model's architecture so that the final network retains most of its accuracy but becomes more light-weight and fast.
arXiv Detail & Related papers (2020-06-26T19:42:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.