Factorization Vision Transformer: Modeling Long Range Dependency with
Local Window Cost
- URL: http://arxiv.org/abs/2312.08614v1
- Date: Thu, 14 Dec 2023 02:38:12 GMT
- Title: Factorization Vision Transformer: Modeling Long Range Dependency with
Local Window Cost
- Authors: Haolin Qin, Daquan Zhou, Tingfa Xu, Ziyang Bian, Jianan Li
- Abstract summary: We propose a factorization self-attention mechanism (FaSA) that enjoys both the advantages of local window cost and long-range dependency modeling capability.
FaViT achieves high performance and robustness, with linear computational complexity concerning input image spatial resolution.
Our FaViT-B2 significantly improves classification accuracy by 1% and robustness by 7%, while reducing model parameters by 14%.
- Score: 25.67071603343174
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have astounding representational power but typically consume
considerable computation which is quadratic with image resolution. The
prevailing Swin transformer reduces computational costs through a local window
strategy. However, this strategy inevitably causes two drawbacks: (1) the local
window-based self-attention hinders global dependency modeling capability; (2)
recent studies point out that local windows impair robustness. To overcome
these challenges, we pursue a preferable trade-off between computational cost
and performance. Accordingly, we propose a novel factorization self-attention
mechanism (FaSA) that enjoys both the advantages of local window cost and
long-range dependency modeling capability. By factorizing the conventional
attention matrix into sparse sub-attention matrices, FaSA captures long-range
dependencies while aggregating mixed-grained information at a computational
cost equivalent to the local window-based self-attention. Leveraging FaSA, we
present the factorization vision transformer (FaViT) with a hierarchical
structure. FaViT achieves high performance and robustness, with linear
computational complexity concerning input image spatial resolution. Extensive
experiments have shown FaViT's advanced performance in classification and
downstream tasks. Furthermore, it also exhibits strong model robustness to
corrupted and biased data and hence demonstrates benefits in favor of practical
applications. In comparison to the baseline model Swin-T, our FaViT-B2
significantly improves classification accuracy by 1% and robustness by 7%,
while reducing model parameters by 14%. Our code will soon be publicly
available at https://github.com/q2479036243/FaViT.
Related papers
- Generalized Nested Latent Variable Models for Lossy Coding applied to Wind Turbine Scenarios [14.48369551534582]
A learning-based approach seeks to minimize the compromise between compression rate and reconstructed image quality.
A successful technique consists in introducing a deep hyperprior that operates within a 2-level nested latent variable model.
This paper extends this concept by designing a generalized L-level nested generative model with a Markov chain structure.
arXiv Detail & Related papers (2024-06-10T11:00:26Z) - Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like
Architectures [99.20299078655376]
This paper introduces Vision-RWKV, a model adapted from the RWKV model used in the NLP field.
Our model is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities.
Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage.
arXiv Detail & Related papers (2024-03-04T18:46:20Z) - Transforming Image Super-Resolution: A ConvFormer-based Efficient
Approach [63.98380888730723]
We introduce the Convolutional Transformer layer (ConvFormer) and the ConvFormer-based Super-Resolution network (CFSR)
CFSR efficiently models long-range dependencies and extensive receptive fields with a slight computational cost.
It achieves 0.39 dB gains on Urban100 dataset for x2 SR task while containing 26% and 31% fewer parameters and FLOPs, respectively.
arXiv Detail & Related papers (2024-01-11T03:08:00Z) - Robust representations of oil wells' intervals via sparse attention
mechanism [2.604557228169423]
We introduce the class of efficient Transformers named Regularized Transformers (Reguformers)
The focus in our experiments is on oil&gas data, namely, well logs.
To evaluate our models for such problems, we work with an industry-scale open dataset consisting of well logs of more than 20 wells.
arXiv Detail & Related papers (2022-12-29T09:56:33Z) - Magic ELF: Image Deraining Meets Association Learning and Transformer [63.761812092934576]
This paper aims to unify CNN and Transformer to take advantage of their learning merits for image deraining.
A novel multi-input attention module (MAM) is proposed to associate rain removal and background recovery.
Our proposed method (dubbed as ELF) outperforms the state-of-the-art approach (MPRNet) by 0.25 dB on average.
arXiv Detail & Related papers (2022-07-21T12:50:54Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Learned Queries for Efficient Local Attention [11.123272845092611]
Self-attention mechanism in vision transformers suffers from high latency and inefficient memory utilization.
We propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner.
We show improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models.
arXiv Detail & Related papers (2021-12-21T18:52:33Z) - Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.