Demystify Transformers & Convolutions in Modern Image Deep Networks
        - URL: http://arxiv.org/abs/2211.05781v2
- Date: Fri, 1 Dec 2023 08:00:51 GMT
- Title: Demystify Transformers & Convolutions in Modern Image Deep Networks
- Authors: Xiaowei Hu, Min Shi, Weiyun Wang, Sitong Wu, Linjie Xing, Wenhai Wang,
  Xizhou Zhu, Lewei Lu, Jie Zhou, Xiaogang Wang, Yu Qiao, Jifeng Dai
- Abstract summary: This paper aims to identify the real gains of popular convolution and attention operators through a detailed study.
We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach.
Our experiments on various tasks and an analysis of inductive bias show a significant performance boost due to advanced network-level and block-level designs.
- Score: 82.32018252867277
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Vision transformers have gained popularity recently, leading to the
development of new vision backbones with improved features and consistent
performance gains. However, these advancements are not solely attributable to
novel feature transformation designs; certain benefits also arise from advanced
network-level and block-level architectures. This paper aims to identify the
real gains of popular convolution and attention operators through a detailed
study. We find that the key difference among these feature transformation
modules, such as attention or convolution, lies in their spatial feature
aggregation approach, known as the "spatial token mixer" (STM). To facilitate
an impartial comparison, we introduce a unified architecture to neutralize the
impact of divergent network-level and block-level designs. Subsequently,
various STMs are integrated into this unified framework for comprehensive
comparative analysis. Our experiments on various tasks and an analysis of
inductive bias show a significant performance boost due to advanced
network-level and block-level designs, but performance differences persist
among different STMs. Our detailed analysis also reveals various findings about
different STMs, such as effective receptive fields and invariance tests. All
models and codes used in this study are publicly available at
\url{https://github.com/OpenGVLab/STM-Evaluation}.
 
      
        Related papers
        - The Sword of Damocles in ViTs: Computational Redundancy Amplifies   Adversarial Transferability [38.32538271219404]
 We investigate the role of computational redundancy in Vision Transformers (ViTs) and its impact on adversarial transferability.
We identify two forms of redundancy, including the data-level and model-level, that can be harnessed to amplify attack effectiveness.
Building on this insight, we design a suite of techniques, including attention sparsity manipulation, attention head permutation, clean token regularization, ghost MoE diversification, and test-time adversarial training.
 arXiv  Detail & Related papers  (2025-04-15T01:59:47Z)
- MSSFC-Net:Enhancing Building Interpretation with Multi-Scale   Spatial-Spectral Feature Collaboration [4.480146005071275]
 Building interpretation from remote sensing imagery primarily involves two fundamental tasks: building extraction and change detection.
We propose a Multi-Scale Spatial-Spectral Feature Cooperative Dual-Task Network (MSSFC-Net) for joint building extraction and change detection in remote sensing images.
 arXiv  Detail & Related papers  (2025-04-01T13:10:23Z)
- Investigation of Hierarchical Spectral Vision Transformer Architecture   for Classification of Hyperspectral Imagery [7.839253919389809]
 The theoretical justification for vision Transformers out-performing CNN architectures in HSI classification remains a question.
A unified hierarchical spectral vision Transformer architecture, specifically tailored for HSI classification is investigated.
It is concluded that the unique strength of vision Transformers can be attributed to their overarching architecture.
 arXiv  Detail & Related papers  (2024-09-14T00:53:13Z)
- Aligning in a Compact Space: Contrastive Knowledge Distillation between   Heterogeneous Architectures [4.119589507611071]
 We propose a Low-Frequency Components-based Contrastive Knowledge Distillation (LFCC) framework that significantly enhances the performance of feature-based distillation.
Specifically, we designe a set of multi-scale low-pass filters to extract the low-frequency components of intermediate features from both the teacher and student models.
We show that LFCC achieves superior performance on the challenging benchmarks of ImageNet-1K and CIFAR-100.
 arXiv  Detail & Related papers  (2024-05-28T18:44:42Z)
- Understanding Self-attention Mechanism via Dynamical System Perspective [58.024376086269015]
 Self-attention mechanism (SAM) is widely used in various fields of artificial intelligence.
We show that intrinsic stiffness phenomenon (SP) in the high-precision solution of ordinary differential equations (ODEs) also widely exists in high-performance neural networks (NN)
We show that the SAM is also a stiffness-aware step size adaptor that can enhance the model's representational ability to measure intrinsic SP.
 arXiv  Detail & Related papers  (2023-08-19T08:17:41Z)
- Improving Stain Invariance of CNNs for Segmentation by Fusing Channel
  Attention and Domain-Adversarial Training [5.501810688265425]
 Variability in staining protocols, such as different slide preparation techniques, chemicals, and scanner configurations, can result in a diverse set of whole slide images (WSIs)
This distribution shift can negatively impact the performance of deep learning models on unseen samples.
We propose a method for improving the generalizability of convolutional neural networks (CNNs) to stain changes in a single-source setting for semantic segmentation.
 arXiv  Detail & Related papers  (2023-04-22T16:54:37Z)
- A Generic Shared Attention Mechanism for Various Backbone Neural   Networks [53.36677373145012]
 Self-attention modules (SAMs) produce strongly correlated attention maps across different layers.
Dense-and-Implicit Attention (DIA) shares SAMs across layers and employs a long short-term memory module.
Our simple yet effective DIA can consistently enhance various network backbones.
 arXiv  Detail & Related papers  (2022-10-27T13:24:08Z)
- Weak Augmentation Guided Relational Self-Supervised Learning [80.0680103295137]
 We introduce a novel relational self-supervised learning (ReSSL) framework that learns representations by modeling the relationship between different instances.
Our proposed method employs sharpened distribution of pairwise similarities among different instances as textitrelation metric.
 Experimental results show that our proposed ReSSL substantially outperforms the state-of-the-art methods across different network architectures.
 arXiv  Detail & Related papers  (2022-03-16T16:14:19Z)
- Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
 Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
 arXiv  Detail & Related papers  (2022-03-15T06:52:25Z)
- Vision Transformers are Robust Learners [65.91359312429147]
 We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
 arXiv  Detail & Related papers  (2021-05-17T02:39:22Z)
- Exploring Complementary Strengths of Invariant and Equivariant
  Representations for Few-Shot Learning [96.75889543560497]
 In many real-world problems, collecting a large number of labeled samples is infeasible.
Few-shot learning is the dominant approach to address this issue, where the objective is to quickly adapt to novel categories in presence of a limited number of samples.
We propose a novel training mechanism that simultaneously enforces equivariance and invariance to a general set of geometric transformations.
 arXiv  Detail & Related papers  (2021-03-01T21:14:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.