Related papers: Static Key Attention in Vision

Related papers

CAViT -- Channel-Aware Vision Transformer for Dynamic Feature Fusion [0.3683202928838613]
Vision Transformers (ViTs) have demonstrated strong performance across a range of computer vision tasks by modeling long-range interactions via self-attention.<n>We introduce 'CAViT', a dual-attention architecture that replaces the static parameter with a dynamic, attention-based mechanism for feature interaction.<n>We validate CAViT across five benchmark datasets spanning both natural and medical domains, where it outperforms the standard ViT baseline by up to +3.6% in accuracy, while reducing FLOPs by over 30%.
arXiv Detail & Related papers (2026-02-05T12:33:09Z)
Indirect Attention: Turning Context Misalignment into a Feature [2.3425919199730694]
This work explores a less conventional scenario, when keys and values originate from different sequences or modalities.<n>We first analyze the attention mechanism's behavior under noisy value features, establishing a critical noise threshold.<n>We then model context (key, value) misalignment as an effective form of structured noise within the value features, demonstrating that the noise induced by such misalignment can substantially exceed this critical threshold.<n>Motivated by this, we introduce Indirect Attention, a modified attention mechanism that infers relevance indirectly in scenarios with misaligned context.
arXiv Detail & Related papers (2025-09-30T09:44:00Z)
Dynamic Relational Priming Improves Transformer in Multivariate Time Series [0.0]
We propose attention with dynamic relational priming (prime attention)<n>We show that prime attention consistently outperforms standard attention across benchmarks.<n>We also find that prime attention achieves comparable or superior performance using up to 40% less sequence length compared to standard attention.
arXiv Detail & Related papers (2025-09-15T17:56:15Z)
Dynamic Attention Mechanism in Spatiotemporal Memory Networks for Object Tracking [8.040709469401257]
We propose a differentiable dynamic attention mechanism that adaptively channel adjusts attention weights by analyzing spatial attention weights. A lightweight gating network that autonomously allocates computational resources based on target motion states, prioritizes high-discriminability features in challenging scenarios.
arXiv Detail & Related papers (2025-03-21T00:48:31Z)
Scaled and Inter-token Relation Enhanced Transformer for Sample-restricted Residential NILM [0.0]
We propose a novel transformer architecture with two key innovations: inter-token relation enhancement and dynamic temperature tuning.<n>We validate our method on the REDD dataset and show that it outperforms the original transformer and state-of-the-art models by 10-15% in F1 score across various appliance types.
arXiv Detail & Related papers (2024-10-12T18:58:45Z)
DualAD: Disentangling the Dynamic and Static World for End-to-End Driving [11.379456277711379]
State-of-the-art approaches for autonomous driving integrate multiple sub-tasks of the overall driving task into a single pipeline. We propose dedicated representations to disentangle dynamic agents and static scene elements. Our method titled DualAD outperforms independently trained single-task networks.
arXiv Detail & Related papers (2024-06-10T13:46:07Z)
Towards Robust Semantic Segmentation against Patch-based Attack via Attention Refinement [68.31147013783387]
We observe that the attention mechanism is vulnerable to patch-based adversarial attacks. In this paper, we propose a Robust Attention Mechanism (RAM) to improve the robustness of the semantic segmentation model.
arXiv Detail & Related papers (2024-01-03T13:58:35Z)
On the Optimization and Generalization of Multi-head Attention [28.33164313549433]
We investigate the potential optimization and generalization advantages of using multiple attention heads. We derive convergence and generalization guarantees for gradient-descent training of a single-layer multi-head self-attention model.
arXiv Detail & Related papers (2023-10-19T12:18:24Z)
Accelerating Vision Transformers Based on Heterogeneous Attention Patterns [89.86293867174324]
Vision Transformers (ViTs) have attracted a lot of attention in the field of computer vision. We propose an integrated compression pipeline based on observed heterogeneous attention patterns across layers. Experimentally, the integrated compression pipeline of DGSSA and GLAD can accelerate up to 121% run-time throughput.
arXiv Detail & Related papers (2023-10-11T17:09:19Z)
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention [87.41016963608067]
We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition. DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
arXiv Detail & Related papers (2023-09-04T08:26:47Z)
Stabilizing Transformer Training by Preventing Attention Entropy Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers. We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training. We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z)
Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition [32.45255303465946]
We introduce sparse attention and monotonic attention into Transformer-based ASR. The experiments show that our method can effectively improve the attention mechanism on widely used benchmarks of speech recognition.
arXiv Detail & Related papers (2022-09-30T01:55:57Z)
Understanding The Robustness in Vision Transformers [140.1090560977082]
Self-attention may promote robustness through improved mid-level representations. We propose a family of fully attentional networks (FANs) that strengthen this capability. Our model achieves a state of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters.
arXiv Detail & Related papers (2022-04-26T17:16:32Z)
Visualizing and Understanding Patch Interactions in Vision Transformer [96.70401478061076]
Vision Transformer (ViT) has become a leading tool in various computer vision tasks. We propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer.
arXiv Detail & Related papers (2022-03-11T13:48:11Z)
Alignment Attention by Matching Key and Query Distributions [48.93793773929006]
This paper introduces alignment attention that explicitly encourages self-attention to match the distributions of the key and query within each head. It is simple to convert any models with self-attention, including pre-trained ones, to the proposed alignment attention. On a variety of language understanding tasks, we show the effectiveness of our method in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-25T00:54:57Z)
Armour: Generalizable Compact Self-Attention for Vision Transformers [0.0]
This paper introduces a compact self-attention mechanism that is fundamental and highly generalizable. We show its drop-in applicability for both the regular attention mechanism and some most recent variants in vision transformers.
arXiv Detail & Related papers (2021-08-03T22:33:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.