MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision
Models
- URL: http://arxiv.org/abs/2210.01820v1
- Date: Tue, 4 Oct 2022 18:00:06 GMT
- Title: MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision
Models
- Authors: Chenglin Yang, Siyuan Qiao, Qihang Yu, Xiaoding Yuan, Yukun Zhu, Alan
Yuille, Hartwig Adam, Liang-Chieh Chen
- Abstract summary: This paper presents MOAT, a family of neural networks that build on top of MObile convolution (i.e., inverted residual blocks) and ATtention.
We replace a standard Transformer block with a mobile convolution block, and further reorder it before the self-attention operation.
Our conceptually simple MOAT networks are surprisingly effective, achieving 89.1% top-1 accuracy on ImageNet-1K with ImageNet-22K pretraining.
- Score: 40.40784209977589
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents MOAT, a family of neural networks that build on top of
MObile convolution (i.e., inverted residual blocks) and ATtention. Unlike the
current works that stack separate mobile convolution and transformer blocks, we
effectively merge them into a MOAT block. Starting with a standard Transformer
block, we replace its multi-layer perceptron with a mobile convolution block,
and further reorder it before the self-attention operation. The mobile
convolution block not only enhances the network representation capacity, but
also produces better downsampled features. Our conceptually simple MOAT
networks are surprisingly effective, achieving 89.1% top-1 accuracy on
ImageNet-1K with ImageNet-22K pretraining. Additionally, MOAT can be seamlessly
applied to downstream tasks that require large resolution inputs by simply
converting the global attention to window attention. Thanks to the mobile
convolution that effectively exchanges local information between pixels (and
thus cross-windows), MOAT does not need the extra window-shifting mechanism. As
a result, on COCO object detection, MOAT achieves 59.2% box AP with 227M model
parameters (single-scale inference, and hard NMS), and on ADE20K semantic
segmentation, MOAT attains 57.6% mIoU with 496M model parameters (single-scale
inference). Finally, the tiny-MOAT family, obtained by simply reducing the
channel sizes, also surprisingly outperforms several mobile-specific
transformer-based models on ImageNet. We hope our simple yet effective MOAT
will inspire more seamless integration of convolution and self-attention. Code
is made publicly available.
Related papers
- EMOv2: Pushing 5M Vision Model Frontier [92.21687467702972]
We set up the new frontier of the 5M magnitude lightweight model on various downstream tasks.
Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer.
Considering the imperceptible latency for mobile users when downloading models under 4G/5G bandwidth, we investigate the performance upper limit of lightweight models with a magnitude of 5M.
arXiv Detail & Related papers (2024-12-09T17:12:22Z) - CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [73.80247057590519]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability.
We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications.
Our model achieves 83.0%/84.1% top-1 with only 12M/21M parameters on ImageNet-1K.
arXiv Detail & Related papers (2024-08-07T11:33:46Z) - Mobile V-MoEs: Scaling Down Vision Transformers via Sparse
Mixture-of-Experts [55.282613372420805]
We explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications.
We propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts.
We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs.
arXiv Detail & Related papers (2023-09-08T14:24:10Z) - Rethinking Mobile Block for Efficient Attention-based Models [60.0312591342016]
This paper focuses on developing modern, efficient, lightweight models for dense predictions while trading off parameters, FLOPs, and performance.
Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterpart has been recognized by attention-based studies.
We extend CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMB) for lightweight model design.
arXiv Detail & Related papers (2023-01-03T15:11:41Z) - MoCoViT: Mobile Convolutional Vision Transformer [13.233314183471213]
We present Mobile Convolutional Vision Transformer (MoCoViT), which improves in performance and efficiency by transformer into mobile convolutional networks.
MoCoViT is carefully designed for mobile devices and is very lightweight, accomplished through two primary modifications.
Comprehensive experiments verify that our proposed MoCoViT family outperform state-of-the-art portable CNNs and transformer on various vision tasks.
arXiv Detail & Related papers (2022-05-25T10:21:57Z) - UniFormer: Unifying Convolution and Self-attention for Visual
Recognition [69.68907941116127]
Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years.
We propose a novel Unified transFormer (UniFormer) which seamlessly integrates the merits of convolution and self-attention in a concise transformer format.
Our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification.
arXiv Detail & Related papers (2022-01-24T04:39:39Z) - ULSAM: Ultra-Lightweight Subspace Attention Module for Compact
Convolutional Neural Networks [4.143032261649983]
"Ultra-Lightweight Subspace Attention Mechanism" (ULSAM) is end-to-end trainable and can be deployed as a plug-and-play module in compact convolutional neural networks (CNNs)
We achieve $approx$13% and $approx$25% reduction in both the FLOPs and parameter counts of MobileNet-V2 with a 0.27% and more than 1% improvement in top-1 accuracy on the ImageNet-1K and fine-grained image classification datasets (respectively)
arXiv Detail & Related papers (2020-06-26T17:05:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.