Focal Modulation Networks
- URL: http://arxiv.org/abs/2203.11926v1
- Date: Tue, 22 Mar 2022 17:54:50 GMT
- Title: Focal Modulation Networks
- Authors: Jianwei Yang, Chunyuan Li, Jianfeng Gao
- Abstract summary: Self-attention (SA) is completely replaced by focal modulation network (FocalNet)
FocalNets with tiny and base sizes achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K.
FocalNets exhibit remarkable superiority when transferred to downstream tasks.
- Score: 105.93086472906765
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we propose focal modulation network (FocalNet in short), where
self-attention (SA) is completely replaced by a focal modulation module that is
more effective and efficient for modeling token interactions. Focal modulation
comprises three components: $(i)$ hierarchical contextualization, implemented
using a stack of depth-wise convolutional layers, to encode visual contexts
from short to long ranges at different granularity levels, $(ii)$ gated
aggregation to selectively aggregate context features for each visual token
(query) based on its content, and $(iii)$ modulation or element-wise affine
transformation to fuse the aggregated features into the query vector. Extensive
experiments show that FocalNets outperform the state-of-the-art SA counterparts
(e.g., Swin Transformers) with similar time and memory cost on the tasks of
image classification, object detection, and semantic segmentation.
Specifically, our FocalNets with tiny and base sizes achieve 82.3% and 83.9%
top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K, it attains
86.5% and 87.3% top-1 accuracy when finetuned with resolution 224$\times$224
and 384$\times$384, respectively. FocalNets exhibit remarkable superiority when
transferred to downstream tasks. For object detection with Mask R-CNN, our
FocalNet base trained with 1$\times$ already surpasses Swin trained with
3$\times$ schedule (49.0 v.s. 48.5). For semantic segmentation with UperNet,
FocalNet base evaluated at single-scale outperforms Swin evaluated at
multi-scale (50.5 v.s. 49.7). These results render focal modulation a favorable
alternative to SA for effective and efficient visual modeling in real-world
applications. Code is available at https://github.com/microsoft/FocalNet.
Related papers
- ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding [9.144813021145039]
This paper introduces ParFormer, a vision transformer that incorporates a Parallel Mixer and a Sparse Channel Attention Patch Embedding (SCAPE)
ParFormer improves feature extraction by combining convolutional and attention mechanisms.
For edge device deployment, ParFormer-T excels with a throughput of 278.1 images/sec, which is 1.38 $times$ higher than EdgeNeXt-S.
The larger variant, ParFormer-L, reaches 83.5% Top-1 accuracy, offering a balanced trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2024-03-22T07:32:21Z) - DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition [62.95223898214866]
We explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field.
With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages.
Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks.
arXiv Detail & Related papers (2023-02-03T14:59:31Z) - MogaNet: Multi-order Gated Aggregation Network [64.16774341908365]
We propose a new family of modern ConvNets, dubbed MogaNet, for discriminative visual representation learning.
MogaNet encapsulates conceptually simple yet effective convolutions and gated aggregation into a compact module.
MogaNet exhibits great scalability, impressive efficiency of parameters, and competitive performance compared to state-of-the-art ViTs and ConvNets on ImageNet.
arXiv Detail & Related papers (2022-11-07T04:31:17Z) - Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision.
We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.
Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z) - EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm [111.17100512647619]
This paper explains the rationality of Vision Transformer by analogy with the proven practical evolutionary algorithm (EA)
We propose a novel pyramid EATFormer backbone that only contains the proposed EA-based transformer (EAT) block.
Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach.
arXiv Detail & Related papers (2022-06-19T04:49:35Z) - CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z) - SA-Net: Shuffle Attention for Deep Convolutional Neural Networks [0.0]
We propose an efficient Shuffle Attention (SA) module to address this issue.
The proposed SA module is efficient yet effective, e.g., the parameters and computations of SA against the backbone ResNet50 are 300 vs. 25.56M and 2.76e-3 GFLOPs vs. 4.12 GFLOPs, respectively.
arXiv Detail & Related papers (2021-01-30T15:23:17Z) - MUXConv: Information Multiplexing in Convolutional Neural Networks [25.284420772533572]
MUXConv is designed to increase the flow of information by progressively multiplexing channel and spatial information in the network.
On ImageNet, the resulting models, dubbed MUXNets, match the performance (75.3% top-1 accuracy) and multiply-add operations (218M) of MobileNetV3.
MUXNet also performs well under transfer learning and when adapted to object detection.
arXiv Detail & Related papers (2020-03-31T00:09:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.