Related papers: Interpreting and Improving Attention From the Perspective of Large Kernel Convolution

Interpreting and Improving Attention From the Perspective of Large Kernel Convolution

URL: http://arxiv.org/abs/2401.05738v3
Date: Mon, 02 Dec 2024 00:04:23 GMT
Title: Interpreting and Improving Attention From the Perspective of Large Kernel Convolution
Authors: Chenghao Li, Chaoning Zhang, Boheng Zeng, Yi Lu, Pengbo Shi, Qingzi Chen, Jirui Liu, Lingyun Zhu, Yang Yang, Heng Tao Shen,
Abstract summary: We introduce Large Kernel Convolutional Attention (LKCA), a novel formulation that reinterprets attention operations as a single large- Kernel convolution.<n>LKCA achieves competitive performance across various visual tasks, particularly in data-constrained settings.
Score: 51.06461246235176
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Attention mechanisms have significantly advanced visual models by capturing global context effectively. However, their reliance on large-scale datasets and substantial computational resources poses challenges in data-scarce and resource-constrained scenarios. Moreover, traditional self-attention mechanisms lack inherent spatial inductive biases, making them suboptimal for modeling local features critical to tasks involving smaller datasets. In this work, we introduce Large Kernel Convolutional Attention (LKCA), a novel formulation that reinterprets attention operations as a single large-kernel convolution. This design unifies the strengths of convolutional architectures locality and translation invariance with the global context modeling capabilities of self-attention. By embedding these properties into a computationally efficient framework, LKCA addresses key limitations of traditional attention mechanisms. The proposed LKCA achieves competitive performance across various visual tasks, particularly in data-constrained settings. Experimental results on CIFAR-10, CIFAR-100, SVHN, and Tiny-ImageNet demonstrate its ability to excel in image classification, outperforming conventional attention mechanisms and vision transformers in compact model settings. These findings highlight the effectiveness of LKCA in bridging local and global feature modeling, offering a practical and robust solution for real-world applications with limited data and resources.

Related papers

Efficient Attention Mechanisms for Large Language Models: A Survey [18.86171225316892]
Transformer-based architectures have become the prevailing computation backbone of large language models.<n>Recent research has introduced two principal categories of efficient attention mechanisms.<n>Sparse attention techniques, in contrast, limit attention to selected subsets of tokens based on fixed patterns, block-wise routing, or clustering strategies.
arXiv Detail & Related papers (2025-07-25T18:08:10Z)
Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality [29.531450446701175]
This paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models.<n>We argue that token reduction can facilitate deeper multimodal integration and alignment, maintain coherence over long inputs, and enhance training stability.<n>We outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, and broader ML and scientific domains.
arXiv Detail & Related papers (2025-05-23T11:30:30Z)
Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning [70.57180215148125]
Visual instruction tuning aims to enable large language models to comprehend the visual world.<n>Existing methods often grapple with the intractable trade-off between accuracy and efficiency.<n>We present LLaVA-Meteor, a novel approach that strategically compresses visual tokens without compromising core information.
arXiv Detail & Related papers (2025-05-17T10:22:29Z)
Quantifying Memory Utilization with Effective State-Size [73.52115209375343]
We develop a measure of textitmemory utilization' This metric is tailored to the fundamental class of systems with textitinput-invariant and textitinput-varying linear operators
arXiv Detail & Related papers (2025-04-28T08:12:30Z)
LSNet: See Large, Focus Small [67.05569159984691]
We introduce LS (textbfLarge-textbfSmall) convolution, which combines large- kernel perception and small- kernel aggregation. LSNet achieves superior performance and efficiency over existing lightweight networks in various vision tasks.
arXiv Detail & Related papers (2025-03-29T16:00:54Z)
C2D-ISR: Optimizing Attention-based Image Super-resolution from Continuous to Discrete Scales [6.700548615812325]
We propose a novel framework, textbfC2D-ISR, for optimizing attention-based image super-resolution models. Our approach is based on a two-stage training methodology and a hierarchical encoding mechanism. In addition, we generalize the hierarchical encoding mechanism with existing attention-based network structures.
arXiv Detail & Related papers (2025-03-17T21:52:18Z)
Context Matters: Query-aware Dynamic Long Sequence Modeling of Gigapixel Images [4.3565203412433195]
Whole slide image (WSI) analysis presents significant computational challenges due to the massive number of patches in gigapixel images. We propose Querent, i.e., the query-aware long contextual dynamic modeling framework. Our approach dramatically reduces computational overhead while preserving global perception to model fine-grained patch correlations.
arXiv Detail & Related papers (2025-01-31T09:29:21Z)
Context-CrackNet: A Context-Aware Framework for Precise Segmentation of Tiny Cracks in Pavement images [3.9599054392856483]
This study proposes Context-CrackNet, a novel encoder-decoder architecture featuring the Region-Focused Enhancement Module (RFEM) and Context-Aware Global Module (CAGM) The model consistently outperformed 9 state-of-the-art segmentation frameworks, achieving superior performance metrics such as mIoU and Dice score. The model's balance of precision and computational efficiency highlights its potential for real-time deployment in large-scale pavement monitoring systems.
arXiv Detail & Related papers (2025-01-24T11:28:17Z)
Core Context Aware Attention for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-range context modeling. Our CCA-Attention significantly outperforms state-of-the-art models in terms of computational efficiency and long-context modeling ability.
arXiv Detail & Related papers (2024-12-17T01:54:08Z)
Point Cloud Understanding via Attention-Driven Contrastive Learning [64.65145700121442]
Transformer-based models have advanced point cloud understanding by leveraging self-attention mechanisms. PointACL is an attention-driven contrastive learning framework designed to address these limitations. Our method employs an attention-driven dynamic masking strategy that guides the model to focus on under-attended regions.
arXiv Detail & Related papers (2024-11-22T05:41:00Z)
Anchor Attention, Small Cache: Code Generation with Large Language Models [15.94784908771546]
Current practices in NLP often use sparse attention which may, unfortunately, lead to substantial inaccuracies, or hallucinations, in code generation tasks. We propose a novel approach, AnchorCoder, which features token-wise anchor attention designed to extract and compress contextual information. It can consistently achieve a significant (at least 70%) reduction in KV cache requirements, while preserving the majority of model's performance.
arXiv Detail & Related papers (2024-11-11T02:47:05Z)
Adaptive Masking Enhances Visual Grounding [12.793586888511978]
We propose IMAGE, Interpretative MAsking with Gaussian radiation modEling, to enhance vocabulary grounding in low-shot learning scenarios. We evaluate the efficacy of our approach on benchmark datasets, including COCO and ODinW, demonstrating its superior performance in zero-shot and few-shot tasks.
arXiv Detail & Related papers (2024-10-04T05:48:02Z)
CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [59.193626019860226]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability. We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers. We show that CAS-ViT achieves a competitive performance when compared to other state-of-the-art backbones.
arXiv Detail & Related papers (2024-08-07T11:33:46Z)
On the Power of Convolution Augmented Transformer [30.46405043231576]
We study the benefits of Convolution-Augmented Transformer (CAT) for recall, copying, and length generalization tasks. Cat incorporates convolutional filters in the K/Q/V embeddings of an attention layer. We show that the locality of the convolution synergizes with the global view of the attention.
arXiv Detail & Related papers (2024-07-08T04:08:35Z)
Low-Resolution Self-Attention for Semantic Segmentation [93.30597515880079]
We introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost. Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution. We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure.
arXiv Detail & Related papers (2023-10-08T06:10:09Z)
Rethinking Local Perception in Lightweight Vision Transformer [63.65115590184169]
This paper introduces CloFormer, a lightweight vision transformer that leverages context-aware local enhancement. CloFormer explores the relationship between globally shared weights often used in vanilla convolutional operators and token-specific context-aware weights appearing in attention. The proposed AttnConv uses shared weights to aggregate local information and deploys carefully designed context-aware weights to enhance local features.
arXiv Detail & Related papers (2023-03-31T05:25:32Z)
A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z)
Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition [158.15602882426379]
This paper does not attempt to design a state-of-the-art method for visual recognition but investigates a more efficient way to make use of convolutions to encode spatial features. By comparing the design principles of the recent convolutional neural networks ConvNets) and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation.
arXiv Detail & Related papers (2022-11-22T01:39:45Z)
ParCNetV2: Oversized Kernel with Enhanced Attention [60.141606180434195]
We introduce a convolutional neural network architecture named ParCNetV2. It extends position-aware circular convolution (ParCNet) with oversized convolutions and strengthens attention through bifurcate gate units. Our method outperforms other pure convolutional neural networks as well as neural networks hybridizing CNNs and transformers.
arXiv Detail & Related papers (2022-11-14T07:22:55Z)
HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions [109.33112814212129]
We show that input-adaptive, long-range and high-order spatial interactions can be efficiently implemented with a convolution-based framework. We present the Recursive Gated Convolution ($textitgtextitn$Conv) that performs high-order spatial interactions with gated convolutions. Based on the operation, we construct a new family of generic vision backbones named HorNet.
arXiv Detail & Related papers (2022-07-28T17:59:02Z)
KVT: k-NN Attention for Boosting Vision Transformers [44.189475770152185]
We propose a sparse attention scheme, dubbed k-NN attention, for boosting vision transformers. The proposed k-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations. We verify, both theoretically and empirically, that $k$-NN attention is powerful in distilling noise from input tokens and in speeding up training.
arXiv Detail & Related papers (2021-05-28T06:49:10Z)
Adaptive Attention Span in Computer Vision [5.371337604556311]
Recent developments in Transformers for language modeling have opened new areas of research in computer vision. Results from late 2019 showed vast performance increases in both object detection and recognition when convolutions are replaced by local self-attention kernels. Models using local self-attention kernels were also shown to have less parameters and FLOPS compared to equivalent architectures that only use convolutions.
arXiv Detail & Related papers (2020-04-18T21:32:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.