Revisiting Transformers with Insights from Image Filtering
- URL: http://arxiv.org/abs/2506.10371v1
- Date: Thu, 12 Jun 2025 05:46:57 GMT
- Title: Revisiting Transformers with Insights from Image Filtering
- Authors: Laziz U. Abdullaev, Maksim Tkachenko, Tan M. Nguyen,
- Abstract summary: Self-attention is a cornerstone of Transformer-based state-of-the-art deep learning architectures.<n>We develop a unifying image processing framework to explain self-attention and its components.<n>We empirically observe that image processing-inspired modifications can lead to notably improved accuracy and robustness against data contamination and adversaries across language and vision tasks as well as better long sequence understanding.
- Score: 3.042104695845305
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The self-attention mechanism, a cornerstone of Transformer-based state-of-the-art deep learning architectures, is largely heuristic-driven and fundamentally challenging to interpret. Establishing a robust theoretical foundation to explain its remarkable success and limitations has therefore become an increasingly prominent focus in recent research. Some notable directions have explored understanding self-attention through the lens of image denoising and nonparametric regression. While promising, existing frameworks still lack a deeper mechanistic interpretation of various architectural components that enhance self-attention, both in its original formulation and subsequent variants. In this work, we aim to advance this understanding by developing a unifying image processing framework, capable of explaining not only the self-attention computation itself but also the role of components such as positional encoding and residual connections, including numerous later variants. We also pinpoint potential distinctions between the two concepts building upon our framework, and make effort to close this gap. We introduce two independent architectural modifications within transformers. While our primary objective is interpretability, we empirically observe that image processing-inspired modifications can also lead to notably improved accuracy and robustness against data contamination and adversaries across language and vision tasks as well as better long sequence understanding.
Related papers
- "Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z) - Reconstruction-guided attention improves the robustness and shape
processing of neural networks [5.156484100374057]
We build an iterative encoder-decoder network that generates an object reconstruction and uses it as top-down attentional feedback.
Our model shows strong generalization performance against various image perturbations.
Our study shows that modeling reconstruction-based feedback endows AI systems with a powerful attention mechanism.
arXiv Detail & Related papers (2022-09-27T18:32:22Z) - On the interplay of adversarial robustness and architecture components:
patches, convolution and attention [65.20660287833537]
We study the effect of adversarial training on the interpretability of the learnt features and robustness to unseen threat models.
An ablation from ResNet to ConvNeXt reveals key architectural changes leading to almost $10%$ higher $ell_infty$-robustness.
arXiv Detail & Related papers (2022-09-14T22:02:32Z) - Robustness and invariance properties of image classifiers [8.970032486260695]
Deep neural networks have achieved impressive results in many image classification tasks.
Deep networks are not robust to a large variety of semantic-preserving image modifications.
The poor robustness of image classifiers to small data distribution shifts raises serious concerns regarding their trustworthiness.
arXiv Detail & Related papers (2022-08-30T11:00:59Z) - Understanding The Robustness in Vision Transformers [140.1090560977082]
Self-attention may promote robustness through improved mid-level representations.
We propose a family of fully attentional networks (FANs) that strengthen this capability.
Our model achieves a state of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters.
arXiv Detail & Related papers (2022-04-26T17:16:32Z) - Perception Over Time: Temporal Dynamics for Robust Image Understanding [5.584060970507506]
Deep learning surpasses human-level performance in narrow and specific vision tasks.
Human visual perception is orders of magnitude more robust to changes in the input stimulus.
We introduce a novel method of incorporating temporal dynamics into static image understanding.
arXiv Detail & Related papers (2022-03-11T21:11:59Z) - IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision
Transformers [81.31885548824926]
Self-attention-based model, transformer, is recently becoming the leading backbone in the field of computer vision.
We present an Interpretability-Aware REDundancy REDuction framework (IA-RED$2$)
We include extensive experiments on both image and video tasks, where our method could deliver up to 1.4X speed-up.
arXiv Detail & Related papers (2021-06-23T18:29:23Z) - Transformers with Competitive Ensembles of Independent Mechanisms [97.93090139318294]
We propose a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention.
We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.
arXiv Detail & Related papers (2021-02-27T21:48:46Z) - Image-to-image Mapping with Many Domains by Sparse Attribute Transfer [71.28847881318013]
Unsupervised image-to-image translation consists of learning a pair of mappings between two domains without known pairwise correspondences between points.
Current convention is to approach this task with cycle-consistent GANs.
We propose an alternate approach that directly restricts the generator to performing a simple sparse transformation in a latent layer.
arXiv Detail & Related papers (2020-06-23T19:52:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.