CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question
Answering with Hypothetical Actions over Images
- URL: http://arxiv.org/abs/2104.05981v1
- Date: Tue, 13 Apr 2021 07:29:21 GMT
- Title: CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question
Answering with Hypothetical Actions over Images
- Authors: Shailaja Keyur Sampat, Akshay Kumar, Yezhou Yang and Chitta Baral
- Abstract summary: We take visual understanding to a higher level where systems are challenged to answer questions that involve mentally simulating the hypothetical consequences of performing specific actions in a given scenario.
We formulate a vision-language question answering task based on the CLEVR dataset.
- Score: 31.317663183139384
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most existing research on visual question answering (VQA) is limited to
information explicitly present in an image or a video. In this paper, we take
visual understanding to a higher level where systems are challenged to answer
questions that involve mentally simulating the hypothetical consequences of
performing specific actions in a given scenario. Towards that end, we formulate
a vision-language question answering task based on the CLEVR (Johnson et. al.,
2017) dataset. We then modify the best existing VQA methods and propose
baseline solvers for this task. Finally, we motivate the development of better
vision-language models by providing insights about the capability of diverse
architectures to perform joint reasoning over image-text modality. Our dataset
setup scripts and codes will be made publicly available at
https://github.com/shailaja183/clevr_hyp.
Related papers
- Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts [3.6064695344878093]
Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content.
This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline.
arXiv Detail & Related papers (2024-04-12T16:35:23Z) - Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs.
We leverage existing 3D detection annotations to generate scene graphs and design question templates manually.
We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z) - A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge [39.788346536244504]
A-OKVQA is a crowdsourced dataset composed of about 25K questions.
We demonstrate the potential of this new dataset through a detailed analysis of its contents.
arXiv Detail & Related papers (2022-06-03T17:52:27Z) - Grounding Answers for Visual Questions Asked by Visually Impaired People [16.978747012406266]
VizWiz-VQA-Grounding is the first dataset that visually grounds answers to visual questions asked by people with visual impairments.
We analyze our dataset and compare it with five VQA-Grounding datasets to demonstrate what makes it similar and different.
arXiv Detail & Related papers (2022-02-04T06:47:16Z) - CapWAP: Captioning with a Purpose [56.99405135645775]
We propose a new task, Captioning with a Purpose (CapWAP)
Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population.
We show that it is possible to use reinforcement learning to directly optimize for the intended information need.
arXiv Detail & Related papers (2020-11-09T09:23:55Z) - Visuo-Linguistic Question Answering (VLQA) Challenge [47.54738740910987]
We propose a novel task to derive joint inference about a given image-text modality.
We compile the Visuo-Linguistic Question Answering (VLQA) challenge corpus in a question answering setting.
arXiv Detail & Related papers (2020-05-01T12:18:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.