Related papers: Demystify Transformers & Convolutions in Modern Image Deep Networks

Demystify Transformers & Convolutions in Modern Image Deep Networks

URL: http://arxiv.org/abs/2211.05781v2
Date: Fri, 1 Dec 2023 08:00:51 GMT
Title: Demystify Transformers & Convolutions in Modern Image Deep Networks
Authors: Xiaowei Hu, Min Shi, Weiyun Wang, Sitong Wu, Linjie Xing, Wenhai Wang, Xizhou Zhu, Lewei Lu, Jie Zhou, Xiaogang Wang, Yu Qiao, Jifeng Dai
Abstract summary: This paper aims to identify the real gains of popular convolution and attention operators through a detailed study. We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach. Our experiments on various tasks and an analysis of inductive bias show a significant performance boost due to advanced network-level and block-level designs.
Score: 82.32018252867277
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision transformers have gained popularity recently, leading to the development of new vision backbones with improved features and consistent performance gains. However, these advancements are not solely attributable to novel feature transformation designs; certain benefits also arise from advanced network-level and block-level architectures. This paper aims to identify the real gains of popular convolution and attention operators through a detailed study. We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach, known as the "spatial token mixer" (STM). To facilitate an impartial comparison, we introduce a unified architecture to neutralize the impact of divergent network-level and block-level designs. Subsequently, various STMs are integrated into this unified framework for comprehensive comparative analysis. Our experiments on various tasks and an analysis of inductive bias show a significant performance boost due to advanced network-level and block-level designs, but performance differences persist among different STMs. Our detailed analysis also reveals various findings about different STMs, such as effective receptive fields and invariance tests. All models and codes used in this study are publicly available at \url{https://github.com/OpenGVLab/STM-Evaluation}.

Related papers

The Sword of Damocles in ViTs: Computational Redundancy Amplifies Adversarial Transferability [38.32538271219404]
We investigate the role of computational redundancy in Vision Transformers (ViTs) and its impact on adversarial transferability. We identify two forms of redundancy, including the data-level and model-level, that can be harnessed to amplify attack effectiveness. Building on this insight, we design a suite of techniques, including attention sparsity manipulation, attention head permutation, clean token regularization, ghost MoE diversification, and test-time adversarial training.
arXiv Detail & Related papers (2025-04-15T01:59:47Z)
MSSFC-Net:Enhancing Building Interpretation with Multi-Scale Spatial-Spectral Feature Collaboration [4.480146005071275]
Building interpretation from remote sensing imagery primarily involves two fundamental tasks: building extraction and change detection. We propose a Multi-Scale Spatial-Spectral Feature Cooperative Dual-Task Network (MSSFC-Net) for joint building extraction and change detection in remote sensing images.
arXiv Detail & Related papers (2025-04-01T13:10:23Z)
Investigation of Hierarchical Spectral Vision Transformer Architecture for Classification of Hyperspectral Imagery [7.839253919389809]
The theoretical justification for vision Transformers out-performing CNN architectures in HSI classification remains a question. A unified hierarchical spectral vision Transformer architecture, specifically tailored for HSI classification is investigated. It is concluded that the unique strength of vision Transformers can be attributed to their overarching architecture.
arXiv Detail & Related papers (2024-09-14T00:53:13Z)
Aligning in a Compact Space: Contrastive Knowledge Distillation between Heterogeneous Architectures [4.119589507611071]
We propose a Low-Frequency Components-based Contrastive Knowledge Distillation (LFCC) framework that significantly enhances the performance of feature-based distillation. Specifically, we designe a set of multi-scale low-pass filters to extract the low-frequency components of intermediate features from both the teacher and student models. We show that LFCC achieves superior performance on the challenging benchmarks of ImageNet-1K and CIFAR-100.
arXiv Detail & Related papers (2024-05-28T18:44:42Z)
Understanding Self-attention Mechanism via Dynamical System Perspective [58.024376086269015]
Self-attention mechanism (SAM) is widely used in various fields of artificial intelligence. We show that intrinsic stiffness phenomenon (SP) in the high-precision solution of ordinary differential equations (ODEs) also widely exists in high-performance neural networks (NN) We show that the SAM is also a stiffness-aware step size adaptor that can enhance the model's representational ability to measure intrinsic SP.
arXiv Detail & Related papers (2023-08-19T08:17:41Z)
Improving Stain Invariance of CNNs for Segmentation by Fusing Channel Attention and Domain-Adversarial Training [5.501810688265425]
Variability in staining protocols, such as different slide preparation techniques, chemicals, and scanner configurations, can result in a diverse set of whole slide images (WSIs) This distribution shift can negatively impact the performance of deep learning models on unseen samples. We propose a method for improving the generalizability of convolutional neural networks (CNNs) to stain changes in a single-source setting for semantic segmentation.
arXiv Detail & Related papers (2023-04-22T16:54:37Z)
A Generic Shared Attention Mechanism for Various Backbone Neural Networks [53.36677373145012]
Self-attention modules (SAMs) produce strongly correlated attention maps across different layers. Dense-and-Implicit Attention (DIA) shares SAMs across layers and employs a long short-term memory module. Our simple yet effective DIA can consistently enhance various network backbones.
arXiv Detail & Related papers (2022-10-27T13:24:08Z)
Weak Augmentation Guided Relational Self-Supervised Learning [80.0680103295137]
We introduce a novel relational self-supervised learning (ReSSL) framework that learns representations by modeling the relationship between different instances. Our proposed method employs sharpened distribution of pairwise similarities among different instances as textitrelation metric. Experimental results show that our proposed ReSSL substantially outperforms the state-of-the-art methods across different network architectures.
arXiv Detail & Related papers (2022-03-16T16:14:19Z)
Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks. We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers. Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z)
Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
Exploring Complementary Strengths of Invariant and Equivariant Representations for Few-Shot Learning [96.75889543560497]
In many real-world problems, collecting a large number of labeled samples is infeasible. Few-shot learning is the dominant approach to address this issue, where the objective is to quickly adapt to novel categories in presence of a limited number of samples. We propose a novel training mechanism that simultaneously enforces equivariance and invariance to a general set of geometric transformations.
arXiv Detail & Related papers (2021-03-01T21:14:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.