Linguistically-aware Attention for Reducing the Semantic-Gap in
Vision-Language Tasks
- URL: http://arxiv.org/abs/2008.08012v1
- Date: Tue, 18 Aug 2020 16:29:49 GMT
- Title: Linguistically-aware Attention for Reducing the Semantic-Gap in
Vision-Language Tasks
- Authors: Gouthaman KV, Athira Nambiar, Kancheti Sai Srinivas, Anurag Mittal
- Abstract summary: We propose an attention mechanism - Linguistically-aware Attention (LAT) - that leverages object attributes obtained from generic object detectors.
LAT represents visual and textual modalities in a common linguistically-rich space, thus providing linguistic awareness to the attention process.
We apply and demonstrate the effectiveness of LAT in three Vision-language (V-L) tasks: Counting-VQA, VQA, and Image captioning.
- Score: 9.462808515258464
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention models are widely used in Vision-language (V-L) tasks to perform
the visual-textual correlation. Humans perform such a correlation with a strong
linguistic understanding of the visual world. However, even the best performing
attention model in V-L tasks lacks such a high-level linguistic understanding,
thus creating a semantic gap between the modalities. In this paper, we propose
an attention mechanism - Linguistically-aware Attention (LAT) - that leverages
object attributes obtained from generic object detectors along with pre-trained
language models to reduce this semantic gap. LAT represents visual and textual
modalities in a common linguistically-rich space, thus providing linguistic
awareness to the attention process. We apply and demonstrate the effectiveness
of LAT in three V-L tasks: Counting-VQA, VQA, and Image captioning. In
Counting-VQA, we propose a novel counting-specific VQA model to predict an
intuitive count and achieve state-of-the-art results on five datasets. In VQA
and Captioning, we show the generic nature and effectiveness of LAT by adapting
it into various baselines and consistently improving their performance.
Related papers
- VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks [48.67062958311173]
VL-GLUE is a multitask benchmark for natural language understanding.
We show that this benchmark is quite challenging for existing large-scale vision-language models.
arXiv Detail & Related papers (2024-10-17T15:27:17Z) - Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language
Pretraining? [34.609984453754656]
We aim to elucidate the impact of comprehensive linguistic knowledge, including semantic expression and syntactic structure, on multimodal alignment.
Specifically, we design and release the SNARE, the first large-scale multimodal alignment probing benchmark.
arXiv Detail & Related papers (2023-08-24T16:17:40Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - Understanding Attention for Vision-and-Language Tasks [4.752823994295959]
We conduct a comprehensive analysis on understanding the role of attention alignment by looking into the attention score calculation methods.
We also analyse the conditions which attention score calculation mechanism would be more (or less) interpretable.
Our analysis is the first of its kind and provides useful insights of the importance of each attention alignment score calculation when applied at the training phase of VL tasks.
arXiv Detail & Related papers (2022-08-17T06:45:07Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z) - Cross-Modality Relevance for Reasoning on Language and Vision [22.41781462637622]
This work deals with the challenge of learning and reasoning over language and vision data for the related downstream tasks such as visual question answering (VQA) and natural language for visual reasoning (NLVR)
We design a novel cross-modality relevance module that is used in an end-to-end framework to learn the relevance representation between components of various input modalities under the supervision of a target task.
Our proposed approach shows competitive performance on two different language and vision tasks using public benchmarks and improves the state-of-the-art published results.
arXiv Detail & Related papers (2020-05-12T20:17:25Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.