Related papers: Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

URL: http://arxiv.org/abs/2107.05637v1
Date: Mon, 12 Jul 2021 18:00:00 GMT
Title: Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms
Authors: Chenglin Yang, Siyuan Qiao, Adam Kortylewski, Alan Yuille
Abstract summary: Self-Attention has become prevalent in computer vision models. We propose Locally Enhanced Self-Attention (LESA), which enhances the unary term by incorporating it with convolutions. The results on ImageNet and COCO show the superiority of LESA over convolution and self-attention baselines for the tasks of image recognition, object detection, and instance segmentation.
Score: 18.857745441710076
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-Attention has become prevalent in computer vision models. Inspired by fully connected Conditional Random Fields (CRFs), we decompose it into local and context terms. They correspond to the unary and binary terms in CRF and are implemented by attention mechanisms with projection matrices. We observe that the unary terms only make small contributions to the outputs, and meanwhile standard CNNs that rely solely on the unary terms achieve great performances on a variety of tasks. Therefore, we propose Locally Enhanced Self-Attention (LESA), which enhances the unary term by incorporating it with convolutions, and utilizes a fusion module to dynamically couple the unary and binary operations. In our experiments, we replace the self-attention modules with LESA. The results on ImageNet and COCO show the superiority of LESA over convolution and self-attention baselines for the tasks of image recognition, object detection, and instance segmentation. The code is made publicly available.

Related papers

Core Context Aware Transformers for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling.<n>Our method automatically focuses and strengthens core context while diminishing redundancy during the learning process.<n>Our method is able to replace the self-attention module in existing Large Language Models with minimal fine-tuning cost.
arXiv Detail & Related papers (2024-12-17T01:54:08Z)
Grounding Everything: Emerging Localization Properties in Vision-Language Transformers [51.260510447308306]
We show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning. We propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path. We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation.
arXiv Detail & Related papers (2023-12-01T19:06:12Z)
Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS) Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos. Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z)
CoC-GAN: Employing Context Cluster for Unveiling a New Pathway in Image Generation [12.211795836214112]
We propose a unique image generation process premised on the perspective of converting images into a set of point clouds. Our methodology leverages simple clustering methods named Context Clustering (CoC) to generate images from unordered point sets. We introduce this model with the novel structure as the Context Clustering Generative Adversarial Network (CoC-GAN)
arXiv Detail & Related papers (2023-08-23T01:19:58Z)
Self-Attention Based Generative Adversarial Networks For Unsupervised Video Summarization [78.2700757742992]
We build on a popular method where a Generative Adversarial Network (GAN) is trained to create representative summaries. We propose the SUM-GAN-AED model that uses a self-attention mechanism for frame selection, combined with LSTMs for encoding and decoding.
arXiv Detail & Related papers (2023-07-16T19:56:13Z)
USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality. Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z)
Self-Attention for Audio Super-Resolution [0.0]
We propose a network architecture for audio super-resolution that combines convolution and self-attention. Attention-based Feature-Wise Linear Modulation (AFiLM) uses self-attention mechanism instead of recurrent neural networks to modulate the activations of the convolutional model.
arXiv Detail & Related papers (2021-08-26T08:05:07Z)
Boosting Few-shot Semantic Segmentation with Transformers [81.43459055197435]
TRansformer-based Few-shot Semantic segmentation method (TRFS) Our model consists of two modules: Global Enhancement Module (GEM) and Local Enhancement Module (LEM)
arXiv Detail & Related papers (2021-08-04T20:09:21Z)
Learning to Recognize Actions on Objects in Egocentric Video with Attention Dictionaries [51.48859591280838]
We present EgoACO, a deep neural architecture for video action recognition. It learns to pool action-context-object descriptors from frame level features. Cap uses self-attention with a dictionary of learnable weights to pool from the most relevant feature regions.
arXiv Detail & Related papers (2021-02-16T10:26:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.