WebQA: Multihop and Multimodal QA
- URL: http://arxiv.org/abs/2109.00590v1
- Date: Wed, 1 Sep 2021 19:43:59 GMT
- Title: WebQA: Multihop and Multimodal QA
- Authors: Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng
Gao, Yonatan Bisk
- Abstract summary: We propose to bridge the gap between the natural language and computer vision communities with WebQA.
Our challenge is to create a unified multimodal reasoning model that seamlessly transitions and reasons regardless of the source modality.
- Score: 49.683300706718136
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Web search is fundamentally multimodal and multihop. Often, even before
asking a question we choose to go directly to image search to find our answers.
Further, rarely do we find an answer from a single source but aggregate
information and reason through implications. Despite the frequency of this
everyday occurrence, at present, there is no unified question answering
benchmark that requires a single model to answer long-form natural language
questions from text and open-ended visual sources -- akin to a human's
experience. We propose to bridge this gap between the natural language and
computer vision communities with WebQA. We show that A. our multihop text
queries are difficult for a large-scale transformer model, and B. existing
multi-modal transformers and visual representations do not perform well on
open-domain visual queries. Our challenge for the community is to create a
unified multimodal reasoning model that seamlessly transitions and reasons
regardless of the source modality.
Related papers
- Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent [102.31558123570437]
Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs)
We propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch.
arXiv Detail & Related papers (2024-11-05T09:27:21Z) - Multi-LLM QA with Embodied Exploration [55.581423861790945]
We investigate the use of Multi-Embodied LLM Explorers (MELE) for question-answering in an unknown environment.
Multiple LLM-based agents independently explore and then answer queries about a household environment.
We analyze different aggregation methods to generate a single, final answer for each query.
arXiv Detail & Related papers (2024-06-16T12:46:40Z) - Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge [10.074327344317116]
We propose Q&A Prompts to equip AI models with robust cross-modality reasoning ability.
We first use the image-answer pairs and the corresponding questions in a training set as inputs and outputs to train a visual question generation model.
We then use an image tagging model to identify various instances and send packaged image-tag pairs into the visual question generation model to generate relevant questions with the extracted image tags as answers.
arXiv Detail & Related papers (2024-01-19T14:22:29Z) - TxT: Crossmodal End-to-End Learning with Transformers [84.55645255507461]
Reasoning over multiple modalities requires an alignment of semantic concepts across domains.
TxT is a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task.
Our model achieves considerable gains from end-to-end learning for multimodal question answering.
arXiv Detail & Related papers (2021-09-09T17:12:20Z) - MultiModalQA: Complex Question Answering over Text, Tables and Images [52.25399438133274]
We present MultiModalQA: a dataset that requires joint reasoning over text, tables and images.
We create MMQA using a new framework for generating complex multi-modal questions at scale.
We then define a formal language that allows us to take questions that can be answered from a single modality, and combine them to generate cross-modal questions.
arXiv Detail & Related papers (2021-04-13T09:14:28Z) - ManyModalQA: Modality Disambiguation and QA over Diverse Inputs [73.93607719921945]
We present a new multimodal question answering challenge, ManyModalQA, in which an agent must answer a question by considering three distinct modalities.
We collect our data by scraping Wikipedia and then utilize crowdsourcing to collect question-answer pairs.
arXiv Detail & Related papers (2020-01-22T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.