Multimodal Inverse Cloze Task for Knowledge-based Visual Question
Answering
- URL: http://arxiv.org/abs/2301.04366v1
- Date: Wed, 11 Jan 2023 09:16:34 GMT
- Title: Multimodal Inverse Cloze Task for Knowledge-based Visual Question
Answering
- Authors: Paul Lerner, Olivier Ferret, Camille Guinaudeau
- Abstract summary: We present a new pre-training method, Multimodal Inverse Cloze Task, for Knowledge-based Visual Question Answering about named Entities.
KVQAE is a recently introduced task that consists in answering questions about named entities grounded in a visual context using a Knowledge Base.
Our method is applicable to different neural network architectures and leads to a 9% relative-MRR and 15% relative-F1 gain for retrieval and reading comprehension.
- Score: 4.114444605090133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a new pre-training method, Multimodal Inverse Cloze Task, for
Knowledge-based Visual Question Answering about named Entities (KVQAE). KVQAE
is a recently introduced task that consists in answering questions about named
entities grounded in a visual context using a Knowledge Base. Therefore, the
interaction between the modalities is paramount to retrieve information and
must be captured with complex fusion models. As these models require a lot of
training data, we design this pre-training task from existing work in textual
Question Answering. It consists in considering a sentence as a pseudo-question
and its context as a pseudo-relevant passage and is extended by considering
images near texts in multimodal documents. Our method is applicable to
different neural network architectures and leads to a 9% relative-MRR and 15%
relative-F1 gain for retrieval and reading comprehension, respectively, over a
no-pre-training baseline.
Related papers
- Distribution Matching for Multi-Task Learning of Classification Tasks: a
Large-Scale Study on Faces & Beyond [62.406687088097605]
Multi-Task Learning (MTL) is a framework, where multiple related tasks are learned jointly and benefit from a shared representation space.
We show that MTL can be successful with classification tasks with little, or non-overlapping annotations.
We propose a novel approach, where knowledge exchange is enabled between the tasks via distribution matching.
arXiv Detail & Related papers (2024-01-02T14:18:11Z) - Large Language Models and Multimodal Retrieval for Visual Word Sense
Disambiguation [1.8591405259852054]
Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the goal of retrieving an image among a set of candidates.
In this paper, we make a substantial step towards unveiling this interesting task by applying a varying set of approaches.
arXiv Detail & Related papers (2023-10-21T14:35:42Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - UniKGQA: Unified Retrieval and Reasoning for Solving Multi-hop Question
Answering Over Knowledge Graph [89.98762327725112]
Multi-hop Question Answering over Knowledge Graph(KGQA) aims to find the answer entities that are multiple hops away from the topic entities mentioned in a natural language question.
We propose UniKGQA, a novel approach for multi-hop KGQA task, by unifying retrieval and reasoning in both model architecture and parameter learning.
arXiv Detail & Related papers (2022-12-02T04:08:09Z) - Fast Inference and Transfer of Compositional Task Structures for
Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph.
Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks.
Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media
Knowledge Extraction and Grounding [131.8797942031366]
We present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text.
Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.
We introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task.
arXiv Detail & Related papers (2021-12-20T18:23:30Z) - MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided
Multimodal Attention for Textbook Question Answering [7.367945534481411]
We propose a novel model named MoCA, which incorporates multi-stage domain pretraining and multimodal cross attention for the Textbook Question Answering task.
The experimental results show the superiority of our model, which outperforms the state-of-the-art methods by 2.21% and 2.43% for validation and test split respectively.
arXiv Detail & Related papers (2021-12-06T07:58:53Z) - Multi-Task Learning with Deep Neural Networks: A Survey [0.0]
Multi-task learning (MTL) is a subfield of machine learning in which multiple tasks are simultaneously learned by a shared model.
We give an overview of multi-task learning methods for deep neural networks, with the aim of summarizing both the well-established and most recent directions within the field.
arXiv Detail & Related papers (2020-09-10T19:31:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.