FMViT: A multiple-frequency mixing Vision Transformer
- URL: http://arxiv.org/abs/2311.05707v1
- Date: Thu, 9 Nov 2023 19:33:50 GMT
- Title: FMViT: A multiple-frequency mixing Vision Transformer
- Authors: Wei Tan, Yifeng Geng, Xuansong Xie
- Abstract summary: We propose an efficient hybrid ViT architecture named FMViT.
This approach blends high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively.
We demonstrate that FMViT surpasses existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks.
- Score: 17.609263967586926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The transformer model has gained widespread adoption in computer vision tasks
in recent times. However, due to the quadratic time and memory complexity of
self-attention, which is proportional to the number of input tokens, most
existing Vision Transformers (ViTs) encounter challenges in achieving efficient
performance in practical industrial deployment scenarios, such as TensorRT and
CoreML, where traditional CNNs excel. Although some recent attempts have been
made to design CNN-Transformer hybrid architectures to tackle this problem,
their overall performance has not met expectations. To tackle these challenges,
we propose an efficient hybrid ViT architecture named FMViT. This approach
enhances the model's expressive power by blending high-frequency features and
low-frequency features with varying frequencies, enabling it to capture both
local and global information effectively. Additionally, we introduce
deploy-friendly mechanisms such as Convolutional Multigroup Reparameterization
(gMLP), Lightweight Multi-head Self-Attention (RLMHSA), and Convolutional
Fusion Block (CFB) to further improve the model's performance and reduce
computational overhead. Our experiments demonstrate that FMViT surpasses
existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of
latency/accuracy trade-offs for various vision tasks. On the TensorRT platform,
FMViT outperforms Resnet101 by 2.5% (83.3% vs. 80.8%) in top-1 accuracy on the
ImageNet dataset while maintaining similar inference latency. Moreover, FMViT
achieves comparable performance with EfficientNet-B5, but with a 43%
improvement in inference speed. On CoreML, FMViT outperforms MobileOne by 2.6%
in top-1 accuracy on the ImageNet dataset, with inference latency comparable to
MobileOne (78.5% vs. 75.9%). Our code can be found at
https://github.com/tany0699/FMViT.
Related papers
- EfficientViT: Memory Efficient Vision Transformer with Cascaded Group
Attention [44.148667664413004]
We propose a family of high-speed vision transformers named EfficientViT.
We find that the speed of existing transformer models is commonly bounded by memory inefficient operations.
To address this, we present a cascaded group attention module feeding attention heads with different splits.
arXiv Detail & Related papers (2023-05-11T17:59:41Z) - FastViT: A Fast Hybrid Vision Transformer using Structural
Reparameterization [14.707312504365376]
We introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off.
We show that our model is 3.5x faster than CMT, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2023-03-24T17:58:32Z) - Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency.
We also introduce a novel fine-grained joint search strategy for transformer models.
This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z) - Lightweight Vision Transformer with Cross Feature Attention [6.103065659061625]
Convolutional neural networks (CNNs) exploit spatial inductive bias to learn visual representations.
ViTs can learn global representations with their self-attention mechanism, but they are usually heavy-weight and unsuitable for mobile devices.
We propose cross feature attention (XFA) to bring down cost for transformers, and combine efficient mobile CNNs to form a novel light-weight CNN-ViT hybrid model, XFormer.
arXiv Detail & Related papers (2022-07-15T03:27:13Z) - Next-ViT: Next Generation Vision Transformer for Efficient Deployment in
Realistic Industrial Scenarios [19.94294348122248]
Most vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios.
We propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT.
Next-ViT dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off.
arXiv Detail & Related papers (2022-07-12T12:50:34Z) - EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for
Mobile Vision Applications [68.35683849098105]
We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups.
Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K.
Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-06-21T17:59:56Z) - EfficientFormer: Vision Transformers at MobileNet Speed [43.93223983817965]
Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks.
ViT-based models are generally times slower than lightweight convolutional networks.
Recent efforts try to reduce the complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory.
arXiv Detail & Related papers (2022-06-02T17:51:03Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - CMT: Convolutional Neural Networks Meet Vision Transformers [68.10025999594883]
Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image.
There are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs)
We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to model local features.
In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on FLOPs than the existing DeiT and EfficientNet, respectively.
arXiv Detail & Related papers (2021-07-13T17:47:19Z) - When Vision Transformers Outperform ResNets without Pretraining or
Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures.
This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference.
We show that the improved robustness attributes to sparser active neurons in the first few layers.
The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.