MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision
Models
- URL: http://arxiv.org/abs/2210.01820v1
- Date: Tue, 4 Oct 2022 18:00:06 GMT
- Title: MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision
Models
- Authors: Chenglin Yang, Siyuan Qiao, Qihang Yu, Xiaoding Yuan, Yukun Zhu, Alan
Yuille, Hartwig Adam, Liang-Chieh Chen
- Abstract summary: This paper presents MOAT, a family of neural networks that build on top of MObile convolution (i.e., inverted residual blocks) and ATtention.
We replace a standard Transformer block with a mobile convolution block, and further reorder it before the self-attention operation.
Our conceptually simple MOAT networks are surprisingly effective, achieving 89.1% top-1 accuracy on ImageNet-1K with ImageNet-22K pretraining.
- Score: 40.40784209977589
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents MOAT, a family of neural networks that build on top of
MObile convolution (i.e., inverted residual blocks) and ATtention. Unlike the
current works that stack separate mobile convolution and transformer blocks, we
effectively merge them into a MOAT block. Starting with a standard Transformer
block, we replace its multi-layer perceptron with a mobile convolution block,
and further reorder it before the self-attention operation. The mobile
convolution block not only enhances the network representation capacity, but
also produces better downsampled features. Our conceptually simple MOAT
networks are surprisingly effective, achieving 89.1% top-1 accuracy on
ImageNet-1K with ImageNet-22K pretraining. Additionally, MOAT can be seamlessly
applied to downstream tasks that require large resolution inputs by simply
converting the global attention to window attention. Thanks to the mobile
convolution that effectively exchanges local information between pixels (and
thus cross-windows), MOAT does not need the extra window-shifting mechanism. As
a result, on COCO object detection, MOAT achieves 59.2% box AP with 227M model
parameters (single-scale inference, and hard NMS), and on ADE20K semantic
segmentation, MOAT attains 57.6% mIoU with 496M model parameters (single-scale
inference). Finally, the tiny-MOAT family, obtained by simply reducing the
channel sizes, also surprisingly outperforms several mobile-specific
transformer-based models on ImageNet. We hope our simple yet effective MOAT
will inspire more seamless integration of convolution and self-attention. Code
is made publicly available.
Related papers
- Mobile V-MoEs: Scaling Down Vision Transformers via Sparse
Mixture-of-Experts [55.282613372420805]
We explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications.
We propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts.
We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs.
arXiv Detail & Related papers (2023-09-08T14:24:10Z) - Rethinking Mobile Block for Efficient Attention-based Models [60.0312591342016]
This paper focuses on developing modern, efficient, lightweight models for dense predictions while trading off parameters, FLOPs, and performance.
Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterpart has been recognized by attention-based studies.
We extend CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMB) for lightweight model design.
arXiv Detail & Related papers (2023-01-03T15:11:41Z) - MoCoViT: Mobile Convolutional Vision Transformer [13.233314183471213]
We present Mobile Convolutional Vision Transformer (MoCoViT), which improves in performance and efficiency by transformer into mobile convolutional networks.
MoCoViT is carefully designed for mobile devices and is very lightweight, accomplished through two primary modifications.
Comprehensive experiments verify that our proposed MoCoViT family outperform state-of-the-art portable CNNs and transformer on various vision tasks.
arXiv Detail & Related papers (2022-05-25T10:21:57Z) - UniFormer: Unifying Convolution and Self-attention for Visual
Recognition [69.68907941116127]
Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years.
We propose a novel Unified transFormer (UniFormer) which seamlessly integrates the merits of convolution and self-attention in a concise transformer format.
Our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification.
arXiv Detail & Related papers (2022-01-24T04:39:39Z) - A Simple Approach to Image Tilt Correction with Self-Attention MobileNet
for Smartphones [4.989480853499916]
We present a Self-Attention MobileNet that can model long-range dependencies between the image features instead of processing the local region.
We also propose a novel training pipeline for the task of image tilt detection.
We present state-of-the-art results on detecting the image tilt angle on mobile devices as compared to the MobileNetV3 model.
arXiv Detail & Related papers (2021-10-31T03:41:46Z) - CMT: Convolutional Neural Networks Meet Vision Transformers [68.10025999594883]
Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image.
There are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs)
We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to model local features.
In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on FLOPs than the existing DeiT and EfficientNet, respectively.
arXiv Detail & Related papers (2021-07-13T17:47:19Z) - CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z) - ULSAM: Ultra-Lightweight Subspace Attention Module for Compact
Convolutional Neural Networks [4.143032261649983]
"Ultra-Lightweight Subspace Attention Mechanism" (ULSAM) is end-to-end trainable and can be deployed as a plug-and-play module in compact convolutional neural networks (CNNs)
We achieve $approx$13% and $approx$25% reduction in both the FLOPs and parameter counts of MobileNet-V2 with a 0.27% and more than 1% improvement in top-1 accuracy on the ImageNet-1K and fine-grained image classification datasets (respectively)
arXiv Detail & Related papers (2020-06-26T17:05:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.