Inception Transformer
- URL: http://arxiv.org/abs/2205.12956v2
- Date: Thu, 26 May 2022 17:18:32 GMT
- Title: Inception Transformer
- Authors: Chenyang Si, Weihao Yu, Pan Zhou, Yichen Zhou, Xinchao Wang, Shuicheng
Yan
- Abstract summary: Inception Transformer, or iFormer, learns comprehensive features with both high- and low-frequency information in visual data.
We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation.
- Score: 151.939077819196
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies show that Transformer has strong capability of building
long-range dependencies, yet is incompetent in capturing high frequencies that
predominantly convey local information. To tackle this issue, we present a
novel and general-purpose Inception Transformer, or iFormer for short, that
effectively learns comprehensive features with both high- and low-frequency
information in visual data. Specifically, we design an Inception mixer to
explicitly graft the advantages of convolution and max-pooling for capturing
the high-frequency information to Transformers. Different from recent hybrid
frameworks, the Inception mixer brings greater efficiency through a channel
splitting mechanism to adopt parallel convolution/max-pooling path and
self-attention path as high- and low-frequency mixers, while having the
flexibility to model discriminative information scattered within a wide
frequency range. Considering that bottom layers play more roles in capturing
high-frequency details while top layers more in modeling low-frequency global
information, we further introduce a frequency ramp structure, i.e. gradually
decreasing the dimensions fed to the high-frequency mixer and increasing those
to the low-frequency mixer, which can effectively trade-off high- and
low-frequency components across different layers. We benchmark the iFormer on a
series of vision tasks, and showcase that it achieves impressive performance on
image classification, COCO detection and ADE20K segmentation. For example, our
iFormer-S hits the top-1 accuracy of 83.4% on ImageNet-1K, much higher than
DeiT-S by 3.6%, and even slightly better than much bigger model Swin-B (83.3%)
with only 1/4 parameters and 1/3 FLOPs. Code and models will be released at
https://github.com/sail-sg/iFormer.
Related papers
- Frequency-aware Feature Fusion for Dense Image Prediction [99.85757278772262]
We propose Frequency-Aware Feature Fusion (FreqFusion) for dense image prediction tasks.
FreqFusion integrates an Adaptive Low-Pass Filter (ALPF) generator, an offset generator, and an Adaptive High-Pass Filter (AHPF) generator.
Comprehensive visualization and quantitative analysis demonstrate that FreqFusion effectively improves feature consistency and sharpens object boundaries.
arXiv Detail & Related papers (2024-08-23T07:30:34Z) - ML-CrAIST: Multi-scale Low-high Frequency Information-based Cross black Attention with Image Super-resolving Transformer [3.686808512438363]
This work proposes a transformer-based super-resolution architecture called ML-CrAIST.
We operate spatial and channel self-attention, which concurrently model pixel interaction from both spatial and channel dimensions.
We devise a cross-attention block for super-resolution, which explores the correlations between low and high-frequency information.
arXiv Detail & Related papers (2024-08-19T12:23:15Z) - Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators [83.48423407316713]
We present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately.
Our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail.
Our method achieves a state-of-the-art FID score of 2.01 when integrated with the recent work SiT.
arXiv Detail & Related papers (2024-08-11T07:01:39Z) - MCMS: Multi-Category Information and Multi-Scale Stripe Attention for Blind Motion Deblurring [14.874224120737438]
A blind motion deblurring network (MCMS) based on multi-category information and multi-scale stripe attention mechanism is proposed.
The model effectively improves motion deblurring by fusing the edge information of the high-frequency component and the structural information of the low-frequency component.
arXiv Detail & Related papers (2024-05-02T08:25:52Z) - Spiking Wavelet Transformer [1.8712213089437697]
Spiking neural networks (SNNs) offer an energy-efficient alternative to conventional deep learning.
Transformers with SNNs have shown promise for accuracy, but struggle to learn high-frequency patterns.
We propose the Spiking Wavelet Transformer (SWformer), an attention-free architecture that effectively learns comprehensive spatial-frequency features in a spike-driven manner.
arXiv Detail & Related papers (2024-03-17T08:41:48Z) - The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in
Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.
We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks.
We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z) - Learning Spatial-Frequency Transformer for Visual Object Tracking [15.750739748843744]
Recent trackers adopt the Transformer to combine or replace the widely used ResNet as their new backbone network.
We believe these operations ignore the spatial prior of the target object which may lead to sub-optimal results.
We propose a unified Spatial-Frequency Transformer that models the spatial Prior and High-frequency emphasis Attention (GPHA) simultaneously.
arXiv Detail & Related papers (2022-08-18T13:46:12Z) - FAMLP: A Frequency-Aware MLP-Like Architecture For Domain Generalization [73.41395947275473]
We propose a novel frequency-aware architecture, in which the domain-specific features are filtered out in the transformed frequency domain.
Experiments on three benchmarks demonstrate significant performance, outperforming the state-of-the-art methods by a margin of 3%, 4% and 9%, respectively.
arXiv Detail & Related papers (2022-03-24T07:26:29Z) - Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.