Sparse and Structured Visual Attention
- URL: http://arxiv.org/abs/2002.05556v2
- Date: Thu, 8 Jul 2021 12:39:43 GMT
- Title: Sparse and Structured Visual Attention
- Authors: Pedro Henrique Martins, Vlad Niculae, Zita Marinho, Andr\'e Martins
- Abstract summary: We replace the traditional softmax attention mechanism with two alternative sparsity-promoting transformations.
Experiments show gains in accuracy as well as higher similarity to human attention, which suggests better interpretability.
- Score: 15.227884641004673
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual attention mechanisms are widely used in multimodal tasks, as visual
question answering (VQA). One drawback of softmax-based attention mechanisms is
that they assign some probability mass to all image regions, regardless of
their adjacency structure and of their relevance to the text. In this paper, to
better link the image structure with the text, we replace the traditional
softmax attention mechanism with two alternative sparsity-promoting
transformations: sparsemax, which is able to select only the relevant regions
(assigning zero weight to the rest), and a newly proposed Total-Variation
Sparse Attention (TVmax), which further encourages the joint selection of
adjacent spatial locations. Experiments in VQA show gains in accuracy as well
as higher similarity to human attention, which suggests better
interpretability.
Related papers
- Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion.
It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing.
Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z) - TOPIQ: A Top-down Approach from Semantics to Distortions for Image
Quality Assessment [53.72721476803585]
Image Quality Assessment (IQA) is a fundamental task in computer vision that has witnessed remarkable progress with deep neural networks.
We propose a top-down approach that uses high-level semantics to guide the IQA network to focus on semantically important local distortion regions.
A key component of our approach is the proposed cross-scale attention mechanism, which calculates attention maps for lower level features.
arXiv Detail & Related papers (2023-08-06T09:08:37Z) - Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth
Estimation in Dynamic Scenes [51.20150148066458]
We propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the generalizationally crafted masks.
Experiments on real-world datasets prove the significant effectiveness and ability of the proposed method.
arXiv Detail & Related papers (2023-04-18T13:55:24Z) - AF$_2$: Adaptive Focus Framework for Aerial Imagery Segmentation [86.44683367028914]
Aerial imagery segmentation has some unique challenges, the most critical one among which lies in foreground-background imbalance.
We propose Adaptive Focus Framework (AF$), which adopts a hierarchical segmentation procedure and focuses on adaptively utilizing multi-scale representations.
AF$ has significantly improved the accuracy on three widely used aerial benchmarks, as fast as the mainstream method.
arXiv Detail & Related papers (2022-02-18T10:14:45Z) - An attention-driven hierarchical multi-scale representation for visual
recognition [3.3302293148249125]
Convolutional Neural Networks (CNNs) have revolutionized the understanding of visual content.
We propose a method to capture high-level long-range dependencies by exploring Graph Convolutional Networks (GCNs)
Our approach is simple yet extremely effective in solving both the fine-grained and generic visual classification problems.
arXiv Detail & Related papers (2021-10-23T09:22:22Z) - Beyond Self-attention: External Attention using Two Linear Layers for
Visual Tasks [34.32609892928909]
We propose a novel attention mechanism which we call external attention, based on two external, small, learnable, and shared memories.
Our method provides comparable or superior performance to the self-attention mechanism and some of its variants, with much lower computational and memory costs.
arXiv Detail & Related papers (2021-05-05T22:29:52Z) - Multimodal Continuous Visual Attention Mechanisms [3.222802562733787]
We introduce a new continuous attention mechanism that produces multimodal densities in the form of mixtures of Gaussians.
Our densities decompose as a linear combination of unimodal attention mechanisms, enabling closed-form Jacobians for the backpropagation step.
arXiv Detail & Related papers (2021-04-07T10:47:51Z) - Adaptive Bi-directional Attention: Exploring Multi-Granularity
Representations for Machine Reading Comprehension [29.717816161964105]
We propose a novel approach called Adaptive Bidirectional Attention, which adaptively exploits the source representations of different levels to the predictor.
Results are better than the previous state-of-the-art model by 2.5$%$ EM and 2.3$%$ F1 scores.
arXiv Detail & Related papers (2020-12-20T09:31:35Z) - Robust Person Re-Identification through Contextual Mutual Boosting [77.1976737965566]
We propose the Contextual Mutual Boosting Network (CMBN) to localize pedestrians.
It localizes pedestrians and recalibrates features by effectively exploiting contextual information and statistical inference.
Experiments on the benchmarks demonstrate the superiority of the architecture compared the state-of-the-art.
arXiv Detail & Related papers (2020-09-16T06:33:35Z) - Spatially Aware Multimodal Transformers for TextVQA [61.01618988620582]
We study the TextVQA task, i.e., reasoning about text in images to answer a question.
Existing approaches are limited in their use of spatial relations.
We propose a novel spatially aware self-attention layer.
arXiv Detail & Related papers (2020-07-23T17:20:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.