WebQA: Multihop and Multimodal QA
- URL: http://arxiv.org/abs/2109.00590v1
- Date: Wed, 1 Sep 2021 19:43:59 GMT
- Title: WebQA: Multihop and Multimodal QA
- Authors: Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng
Gao, Yonatan Bisk
- Abstract summary: We propose to bridge the gap between the natural language and computer vision communities with WebQA.
Our challenge is to create a unified multimodal reasoning model that seamlessly transitions and reasons regardless of the source modality.
- Score: 49.683300706718136
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Web search is fundamentally multimodal and multihop. Often, even before
asking a question we choose to go directly to image search to find our answers.
Further, rarely do we find an answer from a single source but aggregate
information and reason through implications. Despite the frequency of this
everyday occurrence, at present, there is no unified question answering
benchmark that requires a single model to answer long-form natural language
questions from text and open-ended visual sources -- akin to a human's
experience. We propose to bridge this gap between the natural language and
computer vision communities with WebQA. We show that A. our multihop text
queries are difficult for a large-scale transformer model, and B. existing
multi-modal transformers and visual representations do not perform well on
open-domain visual queries. Our challenge for the community is to create a
unified multimodal reasoning model that seamlessly transitions and reasons
regardless of the source modality.
Related papers
- Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge [6.939439327682393]
We propose Q&A Prompts to equip AI models with robust cross-modality reasoning ability.
We first use the image-answer pairs and the corresponding questions in a training set as inputs and outputs to train a visual question generation model.
We then use an image tagging model to identify various instances and send packaged image-tag pairs into the visual question generation model to generate relevant questions with the extracted image tags as answers.
arXiv Detail & Related papers (2024-01-19T14:22:29Z) - Improving Question Generation with Multi-level Content Planning [70.37285816596527]
This paper addresses the problem of generating questions from a given context and an answer, specifically focusing on questions that require multi-hop reasoning across an extended context.
We propose MultiFactor, a novel QG framework based on multi-level content planning. Specifically, MultiFactor includes two components: FA-model, which simultaneously selects key phrases and generates full answers, and Q-model which takes the generated full answer as an additional input to generate questions.
arXiv Detail & Related papers (2023-10-20T13:57:01Z) - FashionVQA: A Domain-Specific Visual Question Answering System [2.6924405243296134]
We train a visual question answering (VQA) system to answer complex natural language questions about apparel in fashion photoshoot images.
The accuracy of the best model surpasses the human expert level, even when answering human-generated questions.
Our approach for generating a large-scale multimodal domain-specific dataset provides a path for training specialized models capable of communicating in natural language.
arXiv Detail & Related papers (2022-08-24T01:18:13Z) - Delving Deeper into Cross-lingual Visual Question Answering [115.16614806717341]
We show that simple modifications to the standard training setup can substantially reduce the transfer gap to monolingual English performance.
We analyze cross-lingual VQA across different question types of varying complexity for different multilingual multimodal Transformers.
arXiv Detail & Related papers (2022-02-15T18:22:18Z) - TxT: Crossmodal End-to-End Learning with Transformers [84.55645255507461]
Reasoning over multiple modalities requires an alignment of semantic concepts across domains.
TxT is a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task.
Our model achieves considerable gains from end-to-end learning for multimodal question answering.
arXiv Detail & Related papers (2021-09-09T17:12:20Z) - MultiModalQA: Complex Question Answering over Text, Tables and Images [52.25399438133274]
We present MultiModalQA: a dataset that requires joint reasoning over text, tables and images.
We create MMQA using a new framework for generating complex multi-modal questions at scale.
We then define a formal language that allows us to take questions that can be answered from a single modality, and combine them to generate cross-modal questions.
arXiv Detail & Related papers (2021-04-13T09:14:28Z) - ManyModalQA: Modality Disambiguation and QA over Diverse Inputs [73.93607719921945]
We present a new multimodal question answering challenge, ManyModalQA, in which an agent must answer a question by considering three distinct modalities.
We collect our data by scraping Wikipedia and then utilize crowdsourcing to collect question-answer pairs.
arXiv Detail & Related papers (2020-01-22T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.