MultiModalQA: Complex Question Answering over Text, Tables and Images
- URL: http://arxiv.org/abs/2104.06039v1
- Date: Tue, 13 Apr 2021 09:14:28 GMT
- Title: MultiModalQA: Complex Question Answering over Text, Tables and Images
- Authors: Alon Talmor, Ori Yoran, Amnon Catav, Dan Lahav, Yizhong Wang, Akari
Asai, Gabriel Ilharco, Hannaneh Hajishirzi, Jonathan Berant
- Abstract summary: We present MultiModalQA: a dataset that requires joint reasoning over text, tables and images.
We create MMQA using a new framework for generating complex multi-modal questions at scale.
We then define a formal language that allows us to take questions that can be answered from a single modality, and combine them to generate cross-modal questions.
- Score: 52.25399438133274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When answering complex questions, people can seamlessly combine information
from visual, textual and tabular sources. While interest in models that reason
over multiple pieces of evidence has surged in recent years, there has been
relatively little work on question answering models that reason across multiple
modalities. In this paper, we present MultiModalQA(MMQA): a challenging
question answering dataset that requires joint reasoning over text, tables and
images. We create MMQA using a new framework for generating complex multi-modal
questions at scale, harvesting tables from Wikipedia, and attaching images and
text paragraphs using entities that appear in each table. We then define a
formal language that allows us to take questions that can be answered from a
single modality, and combine them to generate cross-modal questions. Last,
crowdsourcing workers take these automatically-generated questions and rephrase
them into more fluent language. We create 29,918 questions through this
procedure, and empirically demonstrate the necessity of a multi-modal multi-hop
approach to solve our task: our multi-hop model, ImplicitDecomp, achieves an
average F1of 51.7 over cross-modal questions, substantially outperforming a
strong baseline that achieves 38.2 F1, but still lags significantly behind
human performance, which is at 90.1 F1
Related papers
- TANQ: An open domain dataset of table answered questions [15.323690523538572]
TANQ is the first open domain question answering dataset where the answers require building tables from information across multiple sources.
We release the full source attribution for every cell in the resulting table and benchmark state-of-the-art language models in open, oracle, and closed book setups.
Our best-performing baseline, GPT4 reaches an overall F1 score of 29.1, lagging behind human performance by 19.7 points.
arXiv Detail & Related papers (2024-05-13T14:07:20Z) - Improving Question Generation with Multi-level Content Planning [70.37285816596527]
This paper addresses the problem of generating questions from a given context and an answer, specifically focusing on questions that require multi-hop reasoning across an extended context.
We propose MultiFactor, a novel QG framework based on multi-level content planning. Specifically, MultiFactor includes two components: FA-model, which simultaneously selects key phrases and generates full answers, and Q-model which takes the generated full answer as an additional input to generate questions.
arXiv Detail & Related papers (2023-10-20T13:57:01Z) - Successive Prompting for Decomposing Complex Questions [50.00659445976735]
Recent works leverage the capabilities of large language models (LMs) to perform complex question answering in a few-shot setting.
We introduce Successive Prompting'', where we iteratively break down a complex task into a simple task, solve it, and then repeat the process until we get the final solution.
Our best model (with successive prompting) achieves an improvement of 5% absolute F1 on a few-shot version of the DROP dataset.
arXiv Detail & Related papers (2022-12-08T06:03:38Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - WebQA: Multihop and Multimodal QA [49.683300706718136]
We propose to bridge the gap between the natural language and computer vision communities with WebQA.
Our challenge is to create a unified multimodal reasoning model that seamlessly transitions and reasons regardless of the source modality.
arXiv Detail & Related papers (2021-09-01T19:43:59Z) - FeTaQA: Free-form Table Question Answering [33.018256483762386]
We introduce FeTaQA, a new dataset with 10K Wikipedia-based table, question, free-form answer, supporting table cells pairs.
FeTaQA yields a more challenging table question answering setting because it requires generating free-form text answers after retrieval, inference, and integration of multiple discontinuous facts from a structured knowledge source.
arXiv Detail & Related papers (2021-04-01T09:59:40Z) - ManyModalQA: Modality Disambiguation and QA over Diverse Inputs [73.93607719921945]
We present a new multimodal question answering challenge, ManyModalQA, in which an agent must answer a question by considering three distinct modalities.
We collect our data by scraping Wikipedia and then utilize crowdsourcing to collect question-answer pairs.
arXiv Detail & Related papers (2020-01-22T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.