Few-Shot Visual Question Generation: A Novel Task and Benchmark Datasets
- URL: http://arxiv.org/abs/2210.07076v1
- Date: Thu, 13 Oct 2022 15:01:15 GMT
- Title: Few-Shot Visual Question Generation: A Novel Task and Benchmark Datasets
- Authors: Anurag Roy, David Johnson Ekka, Saptarshi Ghosh, Abir Das
- Abstract summary: We propose a new Few-Shot Visual Question Generation (FS-VQG) task and provide a comprehensive benchmark to it.
We evaluate various existing VQG approaches as well as popular few-shot solutions based on meta-learning and self-supervised strategies for the FS-VQG task.
Several important findings emerge from our experiments, that shed light on the limits of current models in few-shot vision and language generation tasks.
- Score: 5.45761450227064
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating natural language questions from visual scenes, known as Visual
Question Generation (VQG), has been explored in the recent past where large
amounts of meticulously labeled data provide the training corpus. However, in
practice, it is not uncommon to have only a few images with question
annotations corresponding to a few types of answers. In this paper, we propose
a new and challenging Few-Shot Visual Question Generation (FS-VQG) task and
provide a comprehensive benchmark to it. Specifically, we evaluate various
existing VQG approaches as well as popular few-shot solutions based on
meta-learning and self-supervised strategies for the FS-VQG task. We conduct
experiments on two popular existing datasets VQG and Visual7w. In addition, we
have also cleaned and extended the VQG dataset for use in a few-shot scenario,
with additional image-question pairs as well as additional answer categories.
We call this new dataset VQG-23. Several important findings emerge from our
experiments, that shed light on the limits of current models in few-shot vision
and language generation tasks. We find that trivially extending existing VQG
approaches with transfer learning or meta-learning may not be enough to tackle
the inherent challenges in few-shot VQG. We believe that this work will
contribute to accelerating the progress in few-shot learning research.
Related papers
- SimpsonsVQA: Enhancing Inquiry-Based Learning with a Tailored Dataset [11.729464930866483]
"SimpsonsVQA" is a novel dataset for VQA derived from The Simpsons TV show.
It is designed to address not only the traditional VQA task but also to identify irrelevant questions related to images.
SimpsonsVQA contains approximately 23K images, 166K QA pairs, and 500K judgments.
arXiv Detail & Related papers (2024-10-30T02:30:40Z) - From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities [2.0681376988193843]
The work presents a survey in the domain of Visual Question Answering (VQA) that delves into the intricacies of VQA datasets and methods over the field's history.
We further generalize VQA to multimodal question answering, explore tasks related to VQA, and present a set of open problems for future investigation.
arXiv Detail & Related papers (2023-11-01T05:39:41Z) - Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs.
We leverage existing 3D detection annotations to generate scene graphs and design question templates manually.
We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z) - Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA
Task [12.74065821307626]
VQA is an ambitious task aiming to answer any image-related question.
It is hard to build such a system once for all since the needs of users are continuously updated.
We propose a real-data-free replay-based method tailored for CL on VQA, named Scene Graph as Prompt for Replay.
arXiv Detail & Related papers (2022-08-24T12:00:02Z) - From Easy to Hard: Learning Language-guided Curriculum for Visual
Question Answering on Remote Sensing Data [27.160303686163164]
Visual question answering (VQA) for remote sensing scene has great potential in intelligent human-computer interaction system.
No object annotations are available in RSVQA datasets, which makes it difficult for models to exploit informative region representation.
There are questions with clearly different difficulty levels for each image in the RSVQA task.
A multi-level visual feature learning method is proposed to jointly extract language-guided holistic and regional image features.
arXiv Detail & Related papers (2022-05-06T11:37:00Z) - K-VQG: Knowledge-aware Visual Question Generation for Common-sense
Acquisition [64.55573343404572]
We present a novel knowledge-aware VQG dataset called K-VQG.
This is the first large, humanly annotated dataset in which questions regarding images are tied to structured knowledge.
We also develop a new VQG model that can encode and use knowledge as the target for a question.
arXiv Detail & Related papers (2022-03-15T13:38:10Z) - Learning Compositional Representation for Few-shot Visual Question
Answering [93.4061107793983]
Current methods of Visual Question Answering perform well on the answers with an amount of training data but have limited accuracy on the novel ones with few examples.
We propose to extract the attributes from the answers with enough data, which are later composed to constrain the learning of the few-shot ones.
Experimental results on the VQA v2.0 validation dataset demonstrate the effectiveness of our proposed attribute network.
arXiv Detail & Related papers (2021-02-21T10:16:24Z) - In Defense of Grid Features for Visual Question Answering [65.71985794097426]
We revisit grid features for visual question answering (VQA) and find they can work surprisingly well.
We verify that this observation holds true across different VQA models and generalizes well to other tasks like image captioning.
We learn VQA models end-to-end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pre-training.
arXiv Detail & Related papers (2020-01-10T18:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.