Related papers: Artifacts and Attention Sinks: Structured Approximations for Efficient Vision Transformers

Artifacts and Attention Sinks: Structured Approximations for Efficient Vision Transformers

URL: http://arxiv.org/abs/2507.16018v1
Date: Mon, 21 Jul 2025 19:29:03 GMT
Title: Artifacts and Attention Sinks: Structured Approximations for Efficient Vision Transformers
Authors: Andrew Lu, Wentinn Liao, Liuhui Wang, Huzheng Yang, Jianbo Shi,
Abstract summary: Vision transformers have emerged as a powerful tool across a wide range of applications, yet their inner workings remain only partially understood.<n>We examine the phenomenon of massive tokens - tokens with exceptionally high activation norms that act as attention sinks - and artifact tokens that emerge as a byproduct during inference.<n>We introduce Fast Nystr"om Attention (FNA), a training-free method that approximates self-attention in linear time and space.
Score: 8.486148475471271
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision transformers have emerged as a powerful tool across a wide range of applications, yet their inner workings remain only partially understood. In this work, we examine the phenomenon of massive tokens - tokens with exceptionally high activation norms that act as attention sinks - and artifact tokens that emerge as a byproduct during inference. Our analysis reveals that these tokens mutually suppress one another through the attention mechanism, playing a critical role in regulating information flow within the network. Leveraging these insights, we introduce Fast Nystr\"om Attention (FNA), a training-free method that approximates self-attention in linear time and space by exploiting the structured patterns formed by massive and artifact tokens. Additionally, we propose a masking strategy to mitigate noise from these tokens, yielding modest performance gains at virtually no cost. We evaluate our approach on popular pretrained vision backbones and demonstrate competitive performance on retrieval, classification, segmentation, and visual question answering (VQA), all while reducing computational overhead.

Related papers

Attention (as Discrete-Time Markov) Chains [70.46604474584181]
We introduce a new interpretation of the attention matrix as a discrete-time Markov chain.<n>Our main observation is that tokens corresponding to semantically similar regions form a set of metastable states.<n>Using these lightweight tools, we demonstrate state-of-the-art zero-shot segmentation.
arXiv Detail & Related papers (2025-07-23T16:20:47Z)
Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration [8.584066042703972]
We propose a many-to-many Token Transforming framework that serves as a generalization of all existing methods.<n>Specifically, we reduce 40% FLOPs and accelerate DeiT-S by $times$1.5 with marginal 0.1% accuracy drop.<n>We extend the method to dense prediction tasks including segmentation, object detection, depth estimation, and language model generation.
arXiv Detail & Related papers (2025-06-06T03:18:11Z)
ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models [59.47738955960352]
ToDRE is a two-stage and training-free token compression framework.<n>It achieves superior performance by pruning tokens based on token Diversity and token-task RElevance.
arXiv Detail & Related papers (2025-05-24T15:47:49Z)
STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference [3.9464481148889354]
We propose STAR (Stage-wise Attention-guided token Reduction), a training-free, plug-and-play framework that approaches token pruning from a global perspective.<n>Instead of pruning at a single point, STAR performs attention-guided reduction in two complementary stages: an early-stage pruning based on visual self-attention to remove redundant low-level features, and a later-stage pruning guided by cross-modal attention to discard task-irrelevant tokens.
arXiv Detail & Related papers (2025-05-18T10:44:45Z)
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model [56.43860351559185]
We introduce textbfTopV, a compatible textbfTOken textbfPruning with inference Time Optimization for fast and low-memory textbfVLM.<n>Our framework incorporates a visual-aware cost function to measure the importance of each source visual token, enabling effective pruning of low-importance tokens.
arXiv Detail & Related papers (2025-03-24T01:47:26Z)
"Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z)
Attention Sinks and Outlier Features: A 'Catch, Tag, and Release' Mechanism for Embeddings [4.30907936718325]
Two prominent features of large language models (LLMs) is the presence of large-norm (outlier) features and the tendency for tokens to attend very strongly to a select few tokens.<n>We show that attention sinks utilize outlier features to: catch a sequence of tokens, tag the captured tokens by applying a common perturbation, and then release the tokens back into the residual stream.
arXiv Detail & Related papers (2025-02-02T21:15:07Z)
Dynamic Token Reduction during Generation for Vision Language Models [11.376359442815986]
We introduce a dynamic pruning strategy tailored for Vision-Language Models (VLMs)<n>Our approach enables flexible adjustment of pruning rates based on the attention distribution.<n>Our experimental results demonstrate that our method not only reduces computational demands but also maintains the quality of responses.
arXiv Detail & Related papers (2025-01-24T03:20:37Z)
[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs [66.5266435598799]
Multi-language Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision tasks.<n>However, their efficient deployment remains a substantial challenge due to high computational costs and memory requirements.<n>We introduce a simple yet effective method for train-free visual compression, called VTC- compression.
arXiv Detail & Related papers (2024-12-08T05:29:39Z)
Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use [74.72150542395487]
An inherent waveform pattern in the attention allocation of large language models (LLMs) significantly affects their performance in tasks demanding a high degree of context awareness. To address this issue, we propose a novel inference method named Attention Buckets.
arXiv Detail & Related papers (2023-12-07T17:24:51Z)
Unlocking Pixels for Reinforcement Learning via Implicit Attention [61.666538764049854]
We make use of new efficient attention algorithms, recently shown to be highly effective for Transformers. This allows our attention-based controllers to scale to larger visual inputs, and facilitate the use of smaller patches. In addition, we propose a new efficient algorithm approximating softmax attention with what we call hybrid random features.
arXiv Detail & Related papers (2021-02-08T17:00:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.