Lightweight Vision Transformer with Cross Feature Attention
- URL: http://arxiv.org/abs/2207.07268v2
- Date: Wed, 5 Jul 2023 16:11:41 GMT
- Title: Lightweight Vision Transformer with Cross Feature Attention
- Authors: Youpeng Zhao, Huadong Tang, Yingying Jiang, Yong A and Qiang Wu
- Abstract summary: Convolutional neural networks (CNNs) exploit spatial inductive bias to learn visual representations.
ViTs can learn global representations with their self-attention mechanism, but they are usually heavy-weight and unsuitable for mobile devices.
We propose cross feature attention (XFA) to bring down cost for transformers, and combine efficient mobile CNNs to form a novel light-weight CNN-ViT hybrid model, XFormer.
- Score: 6.103065659061625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in vision transformers (ViTs) have achieved great performance
in visual recognition tasks. Convolutional neural networks (CNNs) exploit
spatial inductive bias to learn visual representations, but these networks are
spatially local. ViTs can learn global representations with their
self-attention mechanism, but they are usually heavy-weight and unsuitable for
mobile devices. In this paper, we propose cross feature attention (XFA) to
bring down computation cost for transformers, and combine efficient mobile CNNs
to form a novel efficient light-weight CNN-ViT hybrid model, XFormer, which can
serve as a general-purpose backbone to learn both global and local
representation. Experimental results show that XFormer outperforms numerous CNN
and ViT-based models across different tasks and datasets. On ImageNet1K
dataset, XFormer achieves top-1 accuracy of 78.5% with 5.5 million parameters,
which is 2.2% and 6.3% more accurate than EfficientNet-B0 (CNN-based) and DeiT
(ViT-based) for similar number of parameters. Our model also performs well when
transferring to object detection and semantic segmentation tasks. On MS COCO
dataset, XFormer exceeds MobileNetV2 by 10.5 AP (22.7 -> 33.2 AP) in YOLOv3
framework with only 6.3M parameters and 3.8G FLOPs. On Cityscapes dataset, with
only a simple all-MLP decoder, XFormer achieves mIoU of 78.5 and FPS of 15.3,
surpassing state-of-the-art lightweight segmentation networks.
Related papers
- CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction [14.377544481394013]
CTA-Net combines CNNs and ViTs, with transformers capturing long-range dependencies and CNNs extracting localized features.
This integration enables efficient processing of detailed local and broader contextual information.
Experiments on small-scale datasets with fewer than 100,000 samples show that CTA-Net achieves superior performance.
arXiv Detail & Related papers (2024-10-15T09:27:26Z) - RepNeXt: A Fast Multi-Scale CNN using Structural Reparameterization [8.346566205092433]
lightweight Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are favored for their parameter efficiency and low latency.
This study investigates the complementary advantages of CNNs and ViTs to develop a versatile vision backbone tailored for resource-constrained applications.
arXiv Detail & Related papers (2024-06-23T04:11:12Z) - OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation [70.17681136234202]
We reexamine the design distinctions and test the limits of what a sparse CNN can achieve.
We propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap.
This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module.
arXiv Detail & Related papers (2024-03-21T14:06:38Z) - FMViT: A multiple-frequency mixing Vision Transformer [17.609263967586926]
We propose an efficient hybrid ViT architecture named FMViT.
This approach blends high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively.
We demonstrate that FMViT surpasses existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks.
arXiv Detail & Related papers (2023-11-09T19:33:50Z) - Bridging the Gap Between Vision Transformers and Convolutional Neural
Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets.
We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases.
Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z) - EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for
Mobile Vision Applications [68.35683849098105]
We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups.
Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K.
Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-06-21T17:59:56Z) - Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision.
We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.
Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision
Transformer [24.47196590256829]
We introduce MobileViT, a light-weight vision transformer for mobile devices.
Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets.
arXiv Detail & Related papers (2021-10-05T17:07:53Z) - Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations.
We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation.
In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.