FFT-based Dynamic Token Mixer for Vision
- URL: http://arxiv.org/abs/2303.03932v2
- Date: Sun, 17 Dec 2023 16:53:44 GMT
- Title: FFT-based Dynamic Token Mixer for Vision
- Authors: Yuki Tatsunami, Masato Taki
- Abstract summary: We propose a novel token-mixer called Dynamic Filter and novel image recognition models, DFFormer and CDFFormer.
Our results indicate that Dynamic Filter is one of the token-mixer options that should be seriously considered.
- Score: 5.439020425819001
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-head-self-attention (MHSA)-equipped models have achieved notable
performance in computer vision. Their computational complexity is proportional
to quadratic numbers of pixels in input feature maps, resulting in slow
processing, especially when dealing with high-resolution images. New types of
token-mixer are proposed as an alternative to MHSA to circumvent this problem:
an FFT-based token-mixer involves global operations similar to MHSA but with
lower computational complexity. However, despite its attractive properties, the
FFT-based token-mixer has not been carefully examined in terms of its
compatibility with the rapidly evolving MetaFormer architecture. Here, we
propose a novel token-mixer called Dynamic Filter and novel image recognition
models, DFFormer and CDFFormer, to close the gaps above. The results of image
classification and downstream tasks, analysis, and visualization show that our
models are helpful. Notably, their throughput and memory efficiency when
dealing with high-resolution image recognition is remarkable. Our results
indicate that Dynamic Filter is one of the token-mixer options that should be
seriously considered. The code is available at
https://github.com/okojoalg/dfformer
Related papers
- Mixing Histopathology Prototypes into Robust Slide-Level Representations
for Cancer Subtyping [19.577541771516124]
Whole-slide image analysis via the means of computational pathology often relies on processing tessellated gigapixel images with only slide-level labels available.
Applying multiple instance learning-based methods or transformer models is computationally expensive as each image, all instances have to be processed simultaneously.
TheMixer is an under-explored alternative model to common vision transformers, especially for large-scale datasets.
arXiv Detail & Related papers (2023-10-19T14:15:20Z) - Mutual-Guided Dynamic Network for Image Fusion [51.615598671899335]
We propose a novel mutual-guided dynamic network (MGDN) for image fusion, which allows for effective information utilization across different locations and inputs.
Experimental results on five benchmark datasets demonstrate that our proposed method outperforms existing methods on four image fusion tasks.
arXiv Detail & Related papers (2023-08-24T03:50:37Z) - Adaptive Frequency Filters As Efficient Global Token Mixers [100.27957692579892]
We show that adaptive frequency filters can serve as efficient global token mixers.
We take AFF token mixers as primary neural operators to build a lightweight neural network, dubbed AFFNet.
arXiv Detail & Related papers (2023-07-26T07:42:28Z) - T-former: An Efficient Transformer for Image Inpainting [50.43302925662507]
A class of attention-based network architectures, called transformer, has shown significant performance on natural language processing fields.
In this paper, we design a novel attention linearly related to the resolution according to Taylor expansion, and based on this attention, a network called $T$-former is designed for image inpainting.
Experiments on several benchmark datasets demonstrate that our proposed method achieves state-of-the-art accuracy while maintaining a relatively low number of parameters and computational complexity.
arXiv Detail & Related papers (2023-05-12T04:10:42Z) - Spatially-Adaptive Feature Modulation for Efficient Image
Super-Resolution [90.16462805389943]
We develop a spatially-adaptive feature modulation (SAFM) mechanism upon a vision transformer (ViT)-like block.
Proposed method is $3times$ smaller than state-of-the-art efficient SR methods.
arXiv Detail & Related papers (2023-02-27T14:19:31Z) - Efficient Context Integration through Factorized Pyramidal Learning for
Ultra-Lightweight Semantic Segmentation [1.0499611180329804]
We propose a novel Factorized Pyramidal Learning (FPL) module to aggregate rich contextual information in an efficient manner.
We decompose the spatial pyramid into two stages which enables a simple and efficient feature fusion within the module to solve the notorious checkerboard effect.
Based on the FPL module and FIR unit, we propose an ultra-lightweight real-time network, called FPLNet, which achieves state-of-the-art accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-02-23T05:34:51Z) - UHD Image Deblurring via Multi-scale Cubic-Mixer [12.402054374952485]
transformer-based algorithms are making a splash in the domain of image deblurring.
These algorithms depend on the self-attention mechanism with CNN stem to model long range dependencies between tokens.
arXiv Detail & Related papers (2022-06-08T05:04:43Z) - WaveMix: Resource-efficient Token Mixing for Images [2.7188347260210466]
We present WaveMix as an alternative neural architecture that uses a multi-scale 2D discrete wavelet transform (DWT) for spatial token mixing.
WaveMix has achieved State-of-the-art (SOTA) results in EMNIST Byclass and EMNIST Balanced datasets.
arXiv Detail & Related papers (2022-03-07T20:15:17Z) - Adaptive Fourier Neural Operators: Efficient Token Mixers for
Transformers [55.90468016961356]
We propose an efficient token mixer that learns to mix in the Fourier domain.
AFNO is based on a principled foundation of operator learning.
It can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms.
arXiv Detail & Related papers (2021-11-24T05:44:31Z) - Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.