MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media
Knowledge Extraction and Grounding
- URL: http://arxiv.org/abs/2112.10728v1
- Date: Mon, 20 Dec 2021 18:23:30 GMT
- Title: MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media
Knowledge Extraction and Grounding
- Authors: Revanth Gangi Reddy, Xilin Rui, Manling Li, Xudong Lin, Haoyang Wen,
Jaemin Cho, Lifu Huang, Mohit Bansal, Avirup Sil, Shih-Fu Chang, Alexander
Schwing, Heng Ji
- Abstract summary: We present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text.
Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.
We introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task.
- Score: 131.8797942031366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, there has been an increasing interest in building question
answering (QA) models that reason across multiple modalities, such as text and
images. However, QA using images is often limited to just picking the answer
from a pre-defined set of options. In addition, images in the real world,
especially in news, have objects that are co-referential to the text, with
complementary information from both modalities. In this paper, we present a new
QA evaluation benchmark with 1,384 questions over news articles that require
cross-media grounding of objects in images onto text. Specifically, the task
involves multi-hop questions that require reasoning over image-caption pairs to
identify the grounded visual object being referred to and then predicting a
span from the news body text to answer the question. In addition, we introduce
a novel multimedia data augmentation framework, based on cross-media knowledge
extraction and synthetic question-answer generation, to automatically augment
data that can provide weak supervision for this task. We evaluate both
pipeline-based and end-to-end pretraining-based multimedia QA models on our
benchmark, and show that they achieve promising performance, while considerably
lagging behind human performance hence leaving large room for future work on
this challenging new task.
Related papers
- RTQ: Rethinking Video-language Understanding Based on Image-text Model [55.278942477715084]
Video-language understanding presents unique challenges due to the inclusion of highly complex semantic details.
We propose a novel framework called RTQ, which addresses these challenges simultaneously.
Our model demonstrates outstanding performance even in the absence of video-language pre-training.
arXiv Detail & Related papers (2023-12-01T04:51:01Z) - MultiQG-TI: Towards Question Generation from Multi-modal Sources [4.913248451323163]
We study the new problem of automatic question generation from multi-modal sources containing images and texts.
We propose a simple solution for our new problem, called MultiQG-TI, which enables a text-only question generator to process visual input.
We demonstrate that MultiQG-TI significantly outperforms ChatGPT with few-shot prompting, despite having hundred-times less trainable parameters.
arXiv Detail & Related papers (2023-07-07T08:14:15Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - VTQA: Visual Text Question Answering via Entity Alignment and
Cross-Media Reasoning [21.714382546678053]
We present a new challenge with a dataset that contains 23,781 questions based on 10124 image-text pairs.
The aim of this challenge is to develop and benchmark models that are capable of multimedia entity alignment, multi-step reasoning and open-ended answer generation.
arXiv Detail & Related papers (2023-03-05T10:32:26Z) - Multimodal Inverse Cloze Task for Knowledge-based Visual Question
Answering [4.114444605090133]
We present a new pre-training method, Multimodal Inverse Cloze Task, for Knowledge-based Visual Question Answering about named Entities.
KVQAE is a recently introduced task that consists in answering questions about named entities grounded in a visual context using a Knowledge Base.
Our method is applicable to different neural network architectures and leads to a 9% relative-MRR and 15% relative-F1 gain for retrieval and reading comprehension.
arXiv Detail & Related papers (2023-01-11T09:16:34Z) - SPRING: Situated Conversation Agent Pretrained with Multimodal Questions
from Incremental Layout Graph [16.275155481031348]
We propose a Situated Conversation Agent Petrained with Multimodal Questions from INcremental Layout Graph (SPRING)
All QA pairs utilized during pretraining are generated from novel Incremental Layout Graphs (ILG)
Experimental results verify the SPRING's effectiveness, showing that it significantly outperforms state-of-the-art approaches on both SIMMC 1.0 and SIMMC 2.0 datasets.
arXiv Detail & Related papers (2023-01-05T08:03:47Z) - MultiModalQA: Complex Question Answering over Text, Tables and Images [52.25399438133274]
We present MultiModalQA: a dataset that requires joint reasoning over text, tables and images.
We create MMQA using a new framework for generating complex multi-modal questions at scale.
We then define a formal language that allows us to take questions that can be answered from a single modality, and combine them to generate cross-modal questions.
arXiv Detail & Related papers (2021-04-13T09:14:28Z) - Tasks Integrated Networks: Joint Detection and Retrieval for Image
Search [99.49021025124405]
In many real-world searching scenarios (e.g., video surveillance), the objects are seldom accurately detected or annotated.
We first introduce an end-to-end Integrated Net (I-Net), which has three merits.
We further propose an improved I-Net, called DC-I-Net, which makes two new contributions.
arXiv Detail & Related papers (2020-09-03T03:57:50Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.