Demystify Transformers & Convolutions in Modern Image Deep Networks
- URL: http://arxiv.org/abs/2211.05781v2
- Date: Fri, 1 Dec 2023 08:00:51 GMT
- Title: Demystify Transformers & Convolutions in Modern Image Deep Networks
- Authors: Xiaowei Hu, Min Shi, Weiyun Wang, Sitong Wu, Linjie Xing, Wenhai Wang,
Xizhou Zhu, Lewei Lu, Jie Zhou, Xiaogang Wang, Yu Qiao, Jifeng Dai
- Abstract summary: This paper aims to identify the real gains of popular convolution and attention operators through a detailed study.
We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach.
Our experiments on various tasks and an analysis of inductive bias show a significant performance boost due to advanced network-level and block-level designs.
- Score: 82.32018252867277
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision transformers have gained popularity recently, leading to the
development of new vision backbones with improved features and consistent
performance gains. However, these advancements are not solely attributable to
novel feature transformation designs; certain benefits also arise from advanced
network-level and block-level architectures. This paper aims to identify the
real gains of popular convolution and attention operators through a detailed
study. We find that the key difference among these feature transformation
modules, such as attention or convolution, lies in their spatial feature
aggregation approach, known as the "spatial token mixer" (STM). To facilitate
an impartial comparison, we introduce a unified architecture to neutralize the
impact of divergent network-level and block-level designs. Subsequently,
various STMs are integrated into this unified framework for comprehensive
comparative analysis. Our experiments on various tasks and an analysis of
inductive bias show a significant performance boost due to advanced
network-level and block-level designs, but performance differences persist
among different STMs. Our detailed analysis also reveals various findings about
different STMs, such as effective receptive fields and invariance tests. All
models and codes used in this study are publicly available at
\url{https://github.com/OpenGVLab/STM-Evaluation}.
Related papers
- Investigation of Hierarchical Spectral Vision Transformer Architecture for Classification of Hyperspectral Imagery [7.839253919389809]
The theoretical justification for vision Transformers out-performing CNN architectures in HSI classification remains a question.
A unified hierarchical spectral vision Transformer architecture, specifically tailored for HSI classification is investigated.
It is concluded that the unique strength of vision Transformers can be attributed to their overarching architecture.
arXiv Detail & Related papers (2024-09-14T00:53:13Z) - Aligning in a Compact Space: Contrastive Knowledge Distillation between Heterogeneous Architectures [4.119589507611071]
We propose a Low-Frequency Components-based Contrastive Knowledge Distillation (LFCC) framework that significantly enhances the performance of feature-based distillation.
Specifically, we designe a set of multi-scale low-pass filters to extract the low-frequency components of intermediate features from both the teacher and student models.
We show that LFCC achieves superior performance on the challenging benchmarks of ImageNet-1K and CIFAR-100.
arXiv Detail & Related papers (2024-05-28T18:44:42Z) - Improving Stain Invariance of CNNs for Segmentation by Fusing Channel
Attention and Domain-Adversarial Training [5.501810688265425]
Variability in staining protocols, such as different slide preparation techniques, chemicals, and scanner configurations, can result in a diverse set of whole slide images (WSIs)
This distribution shift can negatively impact the performance of deep learning models on unseen samples.
We propose a method for improving the generalizability of convolutional neural networks (CNNs) to stain changes in a single-source setting for semantic segmentation.
arXiv Detail & Related papers (2023-04-22T16:54:37Z) - A Generic Shared Attention Mechanism for Various Backbone Neural Networks [53.36677373145012]
Self-attention modules (SAMs) produce strongly correlated attention maps across different layers.
Dense-and-Implicit Attention (DIA) shares SAMs across layers and employs a long short-term memory module.
Our simple yet effective DIA can consistently enhance various network backbones.
arXiv Detail & Related papers (2022-10-27T13:24:08Z) - Weak Augmentation Guided Relational Self-Supervised Learning [80.0680103295137]
We introduce a novel relational self-supervised learning (ReSSL) framework that learns representations by modeling the relationship between different instances.
Our proposed method employs sharpened distribution of pairwise similarities among different instances as textitrelation metric.
Experimental results show that our proposed ReSSL substantially outperforms the state-of-the-art methods across different network architectures.
arXiv Detail & Related papers (2022-03-16T16:14:19Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Exploring Complementary Strengths of Invariant and Equivariant
Representations for Few-Shot Learning [96.75889543560497]
In many real-world problems, collecting a large number of labeled samples is infeasible.
Few-shot learning is the dominant approach to address this issue, where the objective is to quickly adapt to novel categories in presence of a limited number of samples.
We propose a novel training mechanism that simultaneously enforces equivariance and invariance to a general set of geometric transformations.
arXiv Detail & Related papers (2021-03-01T21:14:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.