Related papers: LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

URL: http://arxiv.org/abs/2104.01136v1
Date: Fri, 2 Apr 2021 16:29:57 GMT
Title: LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference
Authors: Ben Graham and Alaaeldin El-Nouby and Hugo Touvron and Pierre Stock and Armand Joulin and Herv\'e J\'egou and Matthijs Douze
Abstract summary: We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. We introduce the attention bias, a new way to integrate positional information in vision transformers. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff.
Score: 25.63398340113755
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We re-evaluated principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80\% ImageNet top-1 accuracy, LeViT is 3.3 times faster than EfficientNet on the CPU.

Related papers

PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications. ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations. We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z)
FasterViT: Fast Vision Transformers with Hierarchical Attention [63.50580266223651]
We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications. Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs.
arXiv Detail & Related papers (2023-06-09T18:41:37Z)
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization [14.707312504365376]
We introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. We show that our model is 3.5x faster than CMT, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2023-03-24T17:58:32Z)
Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency. We also introduce a novel fine-grained joint search strategy for transformer models. This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z)
Fast-ParC: Capturing Position Aware Global Feature for ConvNets and ViTs [35.39701561076837]
We propose a new basic neural network operator named position-aware circular convolution (ParC) and its accelerated version Fast-ParC. Our Fast-ParC further reduces the O(n2) time complexity of ParC to O(n log n) using Fast Fourier Transform. Experiment results show that our ParC op can effectively enlarge the receptive field of traditional ConvNets.
arXiv Detail & Related papers (2022-10-08T13:14:02Z)
EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z)
Three things everyone should know about Vision Transformers [67.30250766591405]
transformer architectures have rapidly gained traction in computer vision. We offer three insights based on simple and easy to implement variants of vision transformers. We evaluate the impact of these design choices using the ImageNet-1k dataset, and confirm our findings on the ImageNet-v2 test set.
arXiv Detail & Related papers (2022-03-18T08:23:03Z)
ConvNets vs. Transformers: Whose Visual Representations are More Transferable? [49.62201738334348]
We investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations. We observe consistent advantages of Transformer-based backbones on 13 downstream tasks.
arXiv Detail & Related papers (2021-08-11T16:20:38Z)
Rethinking the Design Principles of Robust Vision Transformer [28.538786330184642]
Vision Transformers (ViT) have shown that self-attention-based networks surpassed traditional convolution neural networks (CNNs) in most vision tasks. In this paper, we rethink the design principles of ViTs based on the robustness. By combining the robust design components, we propose Robust Vision Transformer (RVT)
arXiv Detail & Related papers (2021-05-17T15:04:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.