Related papers: Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

URL: http://arxiv.org/abs/2207.05501v2
Date: Wed, 13 Jul 2022 08:59:42 GMT
Title: Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios
Authors: Jiashi Li, Xin Xia, Wei Li, Huixia Li, Xing Wang, Xuefeng Xiao, Rui Wang, Min Zheng, Xin Pan
Abstract summary: Most vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios. We propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT. Next-ViT dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off.
Score: 19.94294348122248
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a distinct challenge: Can a visual neural network be designed to infer as fast as CNNs and perform as powerful as ViTs? Recent works have tried to design CNN-Transformer hybrid architectures to address this issue, yet the overall performance of these works is far away from satisfactory. To end these, we propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT, which dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off. In this work, the Next Convolution Block (NCB) and Next Transformer Block (NTB) are respectively developed to capture local and global information with deployment-friendly mechanisms. Then, Next Hybrid Strategy (NHS) is designed to stack NCB and NTB in an efficient hybrid paradigm, which boosts performance in various downstream tasks. Extensive experiments show that Next-ViT significantly outperforms existing CNNs, ViTs and CNN-Transformer hybrid architectures with respect to the latency/accuracy trade-off across various vision tasks. On TensorRT, Next-ViT surpasses ResNet by 5.4 mAP (from 40.4 to 45.8) on COCO detection and 8.2% mIoU (from 38.8% to 47.0%) on ADE20K segmentation under similar latency. Meanwhile, it achieves comparable performance with CSWin, while the inference speed is accelerated by 3.6x. On CoreML, Next-ViT surpasses EfficientFormer by 4.6 mAP (from 42.6 to 47.2) on COCO detection and 3.5% mIoU (from 45.2% to 48.7%) on ADE20K segmentation under similar latency. Code will be released recently.

Related papers

RepNeXt: A Fast Multi-Scale CNN using Structural Reparameterization [8.346566205092433]
lightweight Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are favored for their parameter efficiency and low latency. This study investigates the complementary advantages of CNNs and ViTs to develop a versatile vision backbone tailored for resource-constrained applications.
arXiv Detail & Related papers (2024-06-23T04:11:12Z)
OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation [70.17681136234202]
We reexamine the design distinctions and test the limits of what a sparse CNN can achieve. We propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap. This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module.
arXiv Detail & Related papers (2024-03-21T14:06:38Z)
HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs [102.4965532024391]
hybrid deep models of Vision Transformer (ViT) and Convolution Neural Network (CNN) have emerged as a powerful class of backbones for vision tasks. We present a new hybrid backbone with HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage ViT to five-stage ViT tailored for high-resolution inputs. HiRI-ViT achieves to-date the best published Top-1 accuracy of 84.3% on ImageNet with 448$times$448 inputs, which absolutely improves 83.4% of iFormer-S by 0.9% with 224$times
arXiv Detail & Related papers (2024-03-18T17:34:29Z)
FMViT: A multiple-frequency mixing Vision Transformer [17.609263967586926]
We propose an efficient hybrid ViT architecture named FMViT. This approach blends high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively. We demonstrate that FMViT surpasses existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks.
arXiv Detail & Related papers (2023-11-09T19:33:50Z)
Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency. We also introduce a novel fine-grained joint search strategy for transformer models. This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z)
Convolutional Embedding Makes Hierarchical Vision Transformer Stronger [16.72943631060293]
Vision Transformers (ViTs) have recently dominated a range of computer vision tasks, yet it suffers from low training data efficiency and inferior local semantic representation capability without appropriate inductive bias. CNNs inherently capture regional-aware semantics, inspiring researchers to introduce CNNs back into the architecture of the ViTs to provide desirable inductive bias for ViTs. In this paper, we explore how the macro architecture of the hybrid CNNs/ViTs enhances the performances of hierarchical ViTs.
arXiv Detail & Related papers (2022-07-27T06:36:36Z)
Lightweight Vision Transformer with Cross Feature Attention [6.103065659061625]
Convolutional neural networks (CNNs) exploit spatial inductive bias to learn visual representations. ViTs can learn global representations with their self-attention mechanism, but they are usually heavy-weight and unsuitable for mobile devices. We propose cross feature attention (XFA) to bring down cost for transformers, and combine efficient mobile CNNs to form a novel light-weight CNN-ViT hybrid model, XFormer.
arXiv Detail & Related papers (2022-07-15T03:27:13Z)
Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision. We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture. Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z)
EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z)
SepViT: Separable Vision Transformer [20.403430632658946]
Vision Transformers often rely on extensive computational costs to achieve high performance, which is burdensome to deploy on resource-constrained devices. We draw lessons from depthwise separable convolution and imitate its ideology to design an efficient Transformer backbone, i.e., Separable Vision Transformer, abbreviated as SepViT. SepViT helps to carry out the local-global information interaction within and among the windows in sequential order via a depthwise separable self-attention.
arXiv Detail & Related papers (2022-03-29T09:20:01Z)
Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations. We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation. In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.