More Than Just Attention: Learning Cross-Modal Attentions with
Contrastive Constraints
- URL: http://arxiv.org/abs/2105.09597v1
- Date: Thu, 20 May 2021 08:48:10 GMT
- Title: More Than Just Attention: Learning Cross-Modal Attentions with
Contrastive Constraints
- Authors: Yuxiao Chen, Jianbo Yuan, Long Zhao, Rui Luo, Larry Davis, Dimitris N.
Metaxas
- Abstract summary: We propose Contrastive Content Re-sourcing ( CCR) and Contrastive Content Swapping ( CCS) constraints to address such limitation.
CCR and CCS constraints supervise the training of attention models in a contrastive learning manner without requiring explicit attention annotations.
Experiments on both Flickr30k and MS-COCO datasets demonstrate that integrating these attention constraints into two state-of-the-art attention-based models improves the model performance.
- Score: 63.08768589044052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention mechanisms have been widely applied to cross-modal tasks such as
image captioning and information retrieval, and have achieved remarkable
improvements due to its capability to learn fine-grained relevance across
different modalities. However, existing attention models could be sub-optimal
and lack preciseness because there is no direct supervision involved during
training. In this work, we propose Contrastive Content Re-sourcing (CCR) and
Contrastive Content Swapping (CCS) constraints to address such limitation.
These constraints supervise the training of attention models in a contrastive
learning manner without requiring explicit attention annotations. Additionally,
we introduce three metrics, namely Attention Precision, Recall and F1-Score, to
quantitatively evaluate the attention quality. We evaluate the proposed
constraints with cross-modal retrieval (image-text matching) task. The
experiments on both Flickr30k and MS-COCO datasets demonstrate that integrating
these attention constraints into two state-of-the-art attention-based models
improves the model performance in terms of both retrieval accuracy and
attention metrics.
Related papers
- Attention Overlap Is Responsible for The Entity Missing Problem in Text-to-image Diffusion Models! [3.355491272942994]
This study examines three potential causes of the entity-missing problem, focusing on cross-attention dynamics.
We found that reducing overlap in attention maps between entities can effectively minimize the rate of entity missing.
arXiv Detail & Related papers (2024-10-28T12:43:48Z) - 2D Feature Distillation for Weakly- and Semi-Supervised 3D Semantic
Segmentation [92.17700318483745]
We propose an image-guidance network (IGNet) which builds upon the idea of distilling high level feature information from a domain adapted synthetically trained 2D semantic segmentation network.
IGNet achieves state-of-the-art results for weakly-supervised LiDAR semantic segmentation on ScribbleKITTI, boasting up to 98% relative performance to fully supervised training with only 8% labeled points.
arXiv Detail & Related papers (2023-11-27T07:57:29Z) - Generic Attention-model Explainability by Weighted Relevance
Accumulation [9.816810016935541]
We propose a weighted relevancy strategy, which takes the importance of token values into consideration, to reduce distortion when equally accumulating relevance.
To evaluate our method, we propose a unified CLIP-based two-stage model, named CLIPmapper, to process Vision-and-Language tasks.
arXiv Detail & Related papers (2023-08-20T12:02:30Z) - SANCL: Multimodal Review Helpfulness Prediction with Selective Attention
and Natural Contrastive Learning [41.92038829041499]
Multimodal Review Helpfulness Prediction (MRHP) aims to sort product reviews according to the predicted helpfulness scores.
Previous work on this task focuses on attention-based modality fusion, information integration, and relation modeling.
We propose SANCL: Selective Attention and Natural Contrastive Learning for MRHP.
arXiv Detail & Related papers (2022-09-12T06:31:13Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - How Knowledge Graph and Attention Help? A Quantitative Analysis into
Bag-level Relation Extraction [66.09605613944201]
We quantitatively evaluate the effect of attention and Knowledge Graph on bag-level relation extraction (RE)
We find that (1) higher attention accuracy may lead to worse performance as it may harm the model's ability to extract entity mention features; (2) the performance of attention is largely influenced by various noise distribution patterns; and (3) KG-enhanced attention indeed improves RE performance, while not through enhanced attention but by incorporating entity prior.
arXiv Detail & Related papers (2021-07-26T09:38:28Z) - Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations.
We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z) - Semi-supervised Left Atrium Segmentation with Mutual Consistency
Training [60.59108570938163]
We propose a novel Mutual Consistency Network (MC-Net) for semi-supervised left atrium segmentation from 3D MR images.
Our MC-Net consists of one encoder and two slightly different decoders, and the prediction discrepancies of two decoders are transformed as an unsupervised loss.
We evaluate our MC-Net on the public Left Atrium (LA) database and it obtains impressive performance gains by exploiting the unlabeled data effectively.
arXiv Detail & Related papers (2021-03-04T09:34:32Z) - Attention Meets Perturbations: Robust and Interpretable Attention with
Adversarial Training [7.106986689736828]
We propose a general training technique for natural language processing tasks, including AT for attention (Attention AT) and more interpretable AT for attention (Attention iAT)
The proposed techniques improved the prediction performance and the model interpretability by exploiting the mechanisms with AT.
arXiv Detail & Related papers (2020-09-25T07:26:45Z) - Cross-Correlated Attention Networks for Person Re-Identification [34.84287025161801]
We propose a new attention module called Cross-Correlated Attention (CCA)
CCA aims to overcome such limitations by maximizing the information gain between different attended regions.
We also propose a novel deep network that makes use of different attention mechanisms to learn robust and discriminative representations of person images.
arXiv Detail & Related papers (2020-06-17T01:47:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.