Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and
Context Terms
- URL: http://arxiv.org/abs/2107.05637v1
- Date: Mon, 12 Jul 2021 18:00:00 GMT
- Title: Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and
Context Terms
- Authors: Chenglin Yang, Siyuan Qiao, Adam Kortylewski, Alan Yuille
- Abstract summary: Self-Attention has become prevalent in computer vision models.
We propose Locally Enhanced Self-Attention (LESA), which enhances the unary term by incorporating it with convolutions.
The results on ImageNet and COCO show the superiority of LESA over convolution and self-attention baselines for the tasks of image recognition, object detection, and instance segmentation.
- Score: 18.857745441710076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-Attention has become prevalent in computer vision models. Inspired by
fully connected Conditional Random Fields (CRFs), we decompose it into local
and context terms. They correspond to the unary and binary terms in CRF and are
implemented by attention mechanisms with projection matrices. We observe that
the unary terms only make small contributions to the outputs, and meanwhile
standard CNNs that rely solely on the unary terms achieve great performances on
a variety of tasks. Therefore, we propose Locally Enhanced Self-Attention
(LESA), which enhances the unary term by incorporating it with convolutions,
and utilizes a fusion module to dynamically couple the unary and binary
operations. In our experiments, we replace the self-attention modules with
LESA. The results on ImageNet and COCO show the superiority of LESA over
convolution and self-attention baselines for the tasks of image recognition,
object detection, and instance segmentation. The code is made publicly
available.
Related papers
- Grounding Everything: Emerging Localization Properties in
Vision-Language Transformers [51.260510447308306]
We show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning.
We propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path.
We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation.
arXiv Detail & Related papers (2023-12-01T19:06:12Z) - Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS)
Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos.
Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z) - CoC-GAN: Employing Context Cluster for Unveiling a New Pathway in Image
Generation [12.211795836214112]
We propose a unique image generation process premised on the perspective of converting images into a set of point clouds.
Our methodology leverages simple clustering methods named Context Clustering (CoC) to generate images from unordered point sets.
We introduce this model with the novel structure as the Context Clustering Generative Adversarial Network (CoC-GAN)
arXiv Detail & Related papers (2023-08-23T01:19:58Z) - Self-Attention Based Generative Adversarial Networks For Unsupervised
Video Summarization [78.2700757742992]
We build on a popular method where a Generative Adversarial Network (GAN) is trained to create representative summaries.
We propose the SUM-GAN-AED model that uses a self-attention mechanism for frame selection, combined with LSTMs for encoding and decoding.
arXiv Detail & Related papers (2023-07-16T19:56:13Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Self-Attention for Audio Super-Resolution [0.0]
We propose a network architecture for audio super-resolution that combines convolution and self-attention.
Attention-based Feature-Wise Linear Modulation (AFiLM) uses self-attention mechanism instead of recurrent neural networks to modulate the activations of the convolutional model.
arXiv Detail & Related papers (2021-08-26T08:05:07Z) - Boosting Few-shot Semantic Segmentation with Transformers [81.43459055197435]
TRansformer-based Few-shot Semantic segmentation method (TRFS)
Our model consists of two modules: Global Enhancement Module (GEM) and Local Enhancement Module (LEM)
arXiv Detail & Related papers (2021-08-04T20:09:21Z) - Learning to Recognize Actions on Objects in Egocentric Video with
Attention Dictionaries [51.48859591280838]
We present EgoACO, a deep neural architecture for video action recognition.
It learns to pool action-context-object descriptors from frame level features.
Cap uses self-attention with a dictionary of learnable weights to pool from the most relevant feature regions.
arXiv Detail & Related papers (2021-02-16T10:26:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.