Cross-modal Retrieval for Knowledge-based Visual Question Answering
- URL: http://arxiv.org/abs/2401.05736v1
- Date: Thu, 11 Jan 2024 08:40:16 GMT
- Title: Cross-modal Retrieval for Knowledge-based Visual Question Answering
- Authors: Paul Lerner, Olivier Ferret (LIST (CEA), DIASI), Camille Guinaudeau
- Abstract summary: Cross-modal retrieval may help bridge the semantic gap between an entity and its depictions.
We provide empirical evidence through experiments with a multimodal dual encoder, namely CLIP, on the recent ViQuAE, InfoSeek, and Encyclopedic-VQA datasets.
- Score: 0.5852077003870417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge-based Visual Question Answering about Named Entities is a
challenging task that requires retrieving information from a multimodal
Knowledge Base. Named entities have diverse visual representations and are
therefore difficult to recognize. We argue that cross-modal retrieval may help
bridge the semantic gap between an entity and its depictions, and is foremost
complementary with mono-modal retrieval. We provide empirical evidence through
experiments with a multimodal dual encoder, namely CLIP, on the recent ViQuAE,
InfoSeek, and Encyclopedic-VQA datasets. Additionally, we study three different
strategies to fine-tune such a model: mono-modal, cross-modal, or joint
training. Our method, which combines mono-and cross-modal retrieval, is
competitive with billion-parameter models on the three datasets, while being
conceptually simpler and computationally cheaper.
Related papers
- Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images.
In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS)
Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z) - Multiple Heads are Better than One: Mixture of Modality Knowledge Experts for Entity Representation Learning [51.80447197290866]
Learning high-quality multi-modal entity representations is an important goal of multi-modal knowledge graph (MMKG) representation learning.
Existing methods focus on crafting elegant entity-wise multi-modal fusion strategies.
We introduce a novel framework with Mixture of Modality Knowledge experts (MoMoK) to learn adaptive multi-modal entity representations.
arXiv Detail & Related papers (2024-05-27T06:36:17Z) - Generative Multi-Modal Knowledge Retrieval with Large Language Models [75.70313858231833]
We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval.
Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases.
We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
arXiv Detail & Related papers (2024-01-16T08:44:29Z) - X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning [109.9413329636322]
This paper introduces an efficient framework that integrates multiple modalities (images, 3D, audio and video) to a frozen Large Language Models (LLMs)
Our approach explores two distinct projection mechanisms: Q-Formers and Linear Projections (LPs)
arXiv Detail & Related papers (2023-11-30T18:43:51Z) - MESED: A Multi-modal Entity Set Expansion Dataset with Fine-grained
Semantic Classes and Hard Negative Entities [25.059177235004952]
We propose Multi-modal Entity Set Expansion (MESE), where models integrate information from multiple modalities to represent entities.
A powerful multi-modal model MultiExpan is proposed which is pre-trained on four multimodal pre-training tasks.
The MESED dataset is the first multi-modal dataset for ESE with large-scale and elaborate manual calibration.
arXiv Detail & Related papers (2023-07-27T14:09:59Z) - Read, Look or Listen? What's Needed for Solving a Multimodal Dataset [7.0430001782867]
We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it.
We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality.
We analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification.
arXiv Detail & Related papers (2023-07-06T08:02:45Z) - Multi-modal Contrastive Representation Learning for Entity Alignment [57.92705405276161]
Multi-modal entity alignment aims to identify equivalent entities between two different multi-modal knowledge graphs.
We propose MCLEA, a Multi-modal Contrastive Learning based Entity Alignment model.
In particular, MCLEA firstly learns multiple individual representations from multiple modalities, and then performs contrastive learning to jointly model intra-modal and inter-modal interactions.
arXiv Detail & Related papers (2022-09-02T08:59:57Z) - Contrastive Cross-Modal Knowledge Sharing Pre-training for
Vision-Language Representation Learning and Retrieval [12.30468719055037]
A Contrastive Cross-Modal Knowledge Sharing Pre-training (COOKIE) is developed to grasp the joint text-image representations.
The first module is a weight-sharing transformer that builds on the head of the visual and textual encoders.
The other one is three specially designed contrastive learning, aiming to share knowledge between different models.
arXiv Detail & Related papers (2022-07-02T04:08:44Z) - Single-Modal Entropy based Active Learning for Visual Question Answering [75.1682163844354]
We address Active Learning in the multi-modal setting of Visual Question Answering (VQA)
In light of the multi-modal inputs, image and question, we propose a novel method for effective sample acquisition.
Our novel idea is simple to implement, cost-efficient, and readily adaptable to other multi-modal tasks.
arXiv Detail & Related papers (2021-10-21T05:38:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.