Generating Natural Questions from Images for Multimodal Assistants
- URL: http://arxiv.org/abs/2012.03678v1
- Date: Tue, 17 Nov 2020 19:12:23 GMT
- Title: Generating Natural Questions from Images for Multimodal Assistants
- Authors: Alkesh Patel, Akanksha Bindal, Hadas Kotek, Christopher Klein, Jason
Williams
- Abstract summary: We present an approach for generating diverse and meaningful questions that consider image content and metadata of image.
We evaluate our approach using standard evaluation metrics such as BLEU, METEOR, ROUGE, and CIDEr.
- Score: 4.930442416763205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating natural, diverse, and meaningful questions from images is an
essential task for multimodal assistants as it confirms whether they have
understood the object and scene in the images properly. The research in visual
question answering (VQA) and visual question generation (VQG) is a great step.
However, this research does not capture questions that a visually-abled person
would ask multimodal assistants. Recently published datasets such as KB-VQA,
FVQA, and OK-VQA try to collect questions that look for external knowledge
which makes them appropriate for multimodal assistants. However, they still
contain many obvious and common-sense questions that humans would not usually
ask a digital assistant. In this paper, we provide a new benchmark dataset that
contains questions generated by human annotators keeping in mind what they
would ask multimodal digital assistants. Large scale annotations for several
hundred thousand images are expensive and time-consuming, so we also present an
effective way of automatically generating questions from unseen images. In this
paper, we present an approach for generating diverse and meaningful questions
that consider image content and metadata of image (e.g., location, associated
keyword). We evaluate our approach using standard evaluation metrics such as
BLEU, METEOR, ROUGE, and CIDEr to show the relevance of generated questions
with human-provided questions. We also measure the diversity of generated
questions using generative strength and inventiveness metrics. We report new
state-of-the-art results on the public and our datasets.
Related papers
- Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge [10.074327344317116]
We propose Q&A Prompts to equip AI models with robust cross-modality reasoning ability.
We first use the image-answer pairs and the corresponding questions in a training set as inputs and outputs to train a visual question generation model.
We then use an image tagging model to identify various instances and send packaged image-tag pairs into the visual question generation model to generate relevant questions with the extracted image tags as answers.
arXiv Detail & Related papers (2024-01-19T14:22:29Z) - Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual
Question Answering in Vietnamese [2.7528170226206443]
We introduce the OpenViVQA dataset, the first large-scale dataset for visual question answering in Vietnamese.
The dataset consists of 11,000+ images associated with 37,000+ question-answer pairs (QAs)
Our proposed methods achieve results that are competitive with SOTA models such as SAAA, MCAN, LORA, and M4C.
arXiv Detail & Related papers (2023-05-07T03:59:31Z) - ChiQA: A Large Scale Image-based Real-World Question Answering Dataset
for Multi-Modal Understanding [42.5118058527339]
ChiQA contains more than 40K questions and more than 200K question-images pairs.
ChiQA requires a deep understanding of both language and vision, including grounding, comparisons, and reading.
We evaluate several state-of-the-art visual-language models such as ALBEF, demonstrating that there is still a large room for improvements on ChiQA.
arXiv Detail & Related papers (2022-08-05T07:55:28Z) - MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media
Knowledge Extraction and Grounding [131.8797942031366]
We present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text.
Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.
We introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task.
arXiv Detail & Related papers (2021-12-20T18:23:30Z) - MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants [4.322454918650574]
We provide a new dataset, MMIU (MultiModal Intent Understanding), that contains questions and corresponding intents provided by human annotators while looking at images.
We, then, use this dataset for intent classification task in multimodal digital assistant.
arXiv Detail & Related papers (2021-10-13T00:57:05Z) - Visual Question Rewriting for Increasing Response Rate [12.700769102964088]
We explore how to automatically rewrite natural language questions to improve the response rate from people.
A new task of Visual Question Rewriting(VQR) task is introduced to explore how visual information can be used to improve the new questions.
arXiv Detail & Related papers (2021-06-04T04:46:47Z) - MultiModalQA: Complex Question Answering over Text, Tables and Images [52.25399438133274]
We present MultiModalQA: a dataset that requires joint reasoning over text, tables and images.
We create MMQA using a new framework for generating complex multi-modal questions at scale.
We then define a formal language that allows us to take questions that can be answered from a single modality, and combine them to generate cross-modal questions.
arXiv Detail & Related papers (2021-04-13T09:14:28Z) - Inquisitive Question Generation for High Level Text Comprehension [60.21497846332531]
We introduce INQUISITIVE, a dataset of 19K questions that are elicited while a person is reading through a document.
We show that readers engage in a series of pragmatic strategies to seek information.
We evaluate question generation models based on GPT-2 and show that our model is able to generate reasonable questions.
arXiv Detail & Related papers (2020-10-04T19:03:39Z) - Visual Question Answering on Image Sets [70.4472272672716]
We introduce the task of Image-Set Visual Question Answering (ISVQA), which generalizes the commonly studied single-image VQA problem to multi-image settings.
Taking a natural language question and a set of images as input, it aims to answer the question based on the content of the images.
The questions can be about objects and relationships in one or more images or about the entire scene depicted by the image set.
arXiv Detail & Related papers (2020-08-27T08:03:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.