Inception Transformer
- URL: http://arxiv.org/abs/2205.12956v2
- Date: Thu, 26 May 2022 17:18:32 GMT
- Title: Inception Transformer
- Authors: Chenyang Si, Weihao Yu, Pan Zhou, Yichen Zhou, Xinchao Wang, Shuicheng
Yan
- Abstract summary: Inception Transformer, or iFormer, learns comprehensive features with both high- and low-frequency information in visual data.
We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation.
- Score: 151.939077819196
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies show that Transformer has strong capability of building
long-range dependencies, yet is incompetent in capturing high frequencies that
predominantly convey local information. To tackle this issue, we present a
novel and general-purpose Inception Transformer, or iFormer for short, that
effectively learns comprehensive features with both high- and low-frequency
information in visual data. Specifically, we design an Inception mixer to
explicitly graft the advantages of convolution and max-pooling for capturing
the high-frequency information to Transformers. Different from recent hybrid
frameworks, the Inception mixer brings greater efficiency through a channel
splitting mechanism to adopt parallel convolution/max-pooling path and
self-attention path as high- and low-frequency mixers, while having the
flexibility to model discriminative information scattered within a wide
frequency range. Considering that bottom layers play more roles in capturing
high-frequency details while top layers more in modeling low-frequency global
information, we further introduce a frequency ramp structure, i.e. gradually
decreasing the dimensions fed to the high-frequency mixer and increasing those
to the low-frequency mixer, which can effectively trade-off high- and
low-frequency components across different layers. We benchmark the iFormer on a
series of vision tasks, and showcase that it achieves impressive performance on
image classification, COCO detection and ADE20K segmentation. For example, our
iFormer-S hits the top-1 accuracy of 83.4% on ImageNet-1K, much higher than
DeiT-S by 3.6%, and even slightly better than much bigger model Swin-B (83.3%)
with only 1/4 parameters and 1/3 FLOPs. Code and models will be released at
https://github.com/sail-sg/iFormer.
Related papers
- MCMS: Multi-Category Information and Multi-Scale Stripe Attention for Blind Motion Deblurring [14.874224120737438]
A blind motion deblurring network (MCMS) based on multi-category information and multi-scale stripe attention mechanism is proposed.
The model effectively improves motion deblurring by fusing the edge information of the high-frequency component and the structural information of the low-frequency component.
arXiv Detail & Related papers (2024-05-02T08:25:52Z) - Spiking Wavelet Transformer [1.8712213089437697]
Spiking neural networks (SNNs) offer an energy-efficient alternative to conventional deep learning by mimicking the event-driven processing of the brain.
It is incompetent to capture high-frequency patterns like moving edge and pixel-level brightness changes due to their reliance on global self-attention operations.
We propose the Spiking Wavelet Transformer (SWformer), an attention-free architecture that effectively learns comprehensive spatial-frequency features in a spike-driven manner.
arXiv Detail & Related papers (2024-03-17T08:41:48Z) - Frequency-Adaptive Pan-Sharpening with Mixture of Experts [22.28680499480492]
We propose a novel Frequency Adaptive Mixture of Experts (FAME) learning framework for pan-sharpening.
Our method performs the best against other state-of-the-art ones and comprises a strong generalization ability for real-world scenes.
arXiv Detail & Related papers (2024-01-04T08:58:25Z) - AligNeRF: High-Fidelity Neural Radiance Fields via Alignment-Aware
Training [100.33713282611448]
We conduct the first pilot study on training NeRF with high-resolution data.
We propose the corresponding solutions, including marrying the multilayer perceptron with convolutional layers.
Our approach is nearly free without introducing obvious training/testing costs.
arXiv Detail & Related papers (2022-11-17T17:22:28Z) - The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in
Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.
We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks.
We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z) - Learning Spatial-Frequency Transformer for Visual Object Tracking [15.750739748843744]
Recent trackers adopt the Transformer to combine or replace the widely used ResNet as their new backbone network.
We believe these operations ignore the spatial prior of the target object which may lead to sub-optimal results.
We propose a unified Spatial-Frequency Transformer that models the spatial Prior and High-frequency emphasis Attention (GPHA) simultaneously.
arXiv Detail & Related papers (2022-08-18T13:46:12Z) - FAMLP: A Frequency-Aware MLP-Like Architecture For Domain Generalization [73.41395947275473]
We propose a novel frequency-aware architecture, in which the domain-specific features are filtered out in the transformed frequency domain.
Experiments on three benchmarks demonstrate significant performance, outperforming the state-of-the-art methods by a margin of 3%, 4% and 9%, respectively.
arXiv Detail & Related papers (2022-03-24T07:26:29Z) - Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z) - Space-time Mixing Attention for Video Transformer [55.50839896863275]
We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence.
We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
arXiv Detail & Related papers (2021-06-10T17:59:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.