Global Filter Networks for Image Classification
- URL: http://arxiv.org/abs/2107.00645v1
- Date: Thu, 1 Jul 2021 17:58:16 GMT
- Title: Global Filter Networks for Image Classification
- Authors: Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, Jie Zhou
- Abstract summary: We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
- Score: 90.81352483076323
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in self-attention and pure multi-layer perceptrons (MLP)
models for vision have shown great potential in achieving promising performance
with fewer inductive biases. These models are generally based on learning
interaction among spatial locations from raw data. The complexity of
self-attention and MLP grows quadratically as the image size increases, which
makes these models hard to scale up when high-resolution features are required.
In this paper, we present the Global Filter Network (GFNet), a conceptually
simple yet computationally efficient architecture, that learns long-term
spatial dependencies in the frequency domain with log-linear complexity. Our
architecture replaces the self-attention layer in vision transformers with
three key operations: a 2D discrete Fourier transform, an element-wise
multiplication between frequency-domain features and learnable global filters,
and a 2D inverse Fourier transform. We exhibit favorable accuracy/complexity
trade-offs of our models on both ImageNet and downstream tasks. Our results
demonstrate that GFNet can be a very competitive alternative to
transformer-style models and CNNs in efficiency, generalization ability and
robustness. Code is available at https://github.com/raoyongming/GFNet
Related papers
- DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention [1.5624421399300303]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)
Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.
These representations are then adapted for transformer input through an innovative patch tokenization.
arXiv Detail & Related papers (2024-07-18T22:15:35Z) - DeblurDiNAT: A Generalizable Transformer for Perceptual Image Deblurring [1.5124439914522694]
DeblurDiNAT is a generalizable and efficient encoder-decoder Transformer which restores clean images visually close to the ground truth.
We present a linear feed-forward network and a non-linear dual-stage feature fusion module for faster feature propagation across the network.
arXiv Detail & Related papers (2024-03-19T21:31:31Z) - As large as it gets: Learning infinitely large Filters via Neural Implicit Functions in the Fourier Domain [22.512062422338914]
Recent work in neural networks for image classification has seen a strong tendency towards increasing the spatial context.
We propose a module for studying the effective filter size of convolutional neural networks.
Our analysis shows that, although the proposed networks could learn very large convolution kernels, the learned filters are well localized and relatively small in practice.
arXiv Detail & Related papers (2023-07-19T14:21:11Z) - Global-to-Local Modeling for Video-based 3D Human Pose and Shape
Estimation [53.04781510348416]
Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness.
We propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT)
Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.
arXiv Detail & Related papers (2023-03-26T14:57:49Z) - Efficient Context Integration through Factorized Pyramidal Learning for
Ultra-Lightweight Semantic Segmentation [1.0499611180329804]
We propose a novel Factorized Pyramidal Learning (FPL) module to aggregate rich contextual information in an efficient manner.
We decompose the spatial pyramid into two stages which enables a simple and efficient feature fusion within the module to solve the notorious checkerboard effect.
Based on the FPL module and FIR unit, we propose an ultra-lightweight real-time network, called FPLNet, which achieves state-of-the-art accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-02-23T05:34:51Z) - Optimizing Vision Transformers for Medical Image Segmentation and
Few-Shot Domain Adaptation [11.690799827071606]
We propose Convolutional Swin-Unet (CS-Unet) transformer blocks and optimise their settings with relation to patch embedding, projection, the feed-forward network, up sampling and skip connections.
CS-Unet can be trained from scratch and inherits the superiority of convolutions in each feature process phase.
Experiments show that CS-Unet without pre-training surpasses other state-of-the-art counterparts by large margins on two medical CT and MRI datasets with fewer parameters.
arXiv Detail & Related papers (2022-10-14T19:18:52Z) - Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision.
A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive.
We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z) - MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation.
It simultaneously learns global semantic information and local spatial-detailed features.
Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z) - FAMLP: A Frequency-Aware MLP-Like Architecture For Domain Generalization [73.41395947275473]
We propose a novel frequency-aware architecture, in which the domain-specific features are filtered out in the transformed frequency domain.
Experiments on three benchmarks demonstrate significant performance, outperforming the state-of-the-art methods by a margin of 3%, 4% and 9%, respectively.
arXiv Detail & Related papers (2022-03-24T07:26:29Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.