ChiQA: A Large Scale Image-based Real-World Question Answering Dataset
for Multi-Modal Understanding
- URL: http://arxiv.org/abs/2208.03030v1
- Date: Fri, 5 Aug 2022 07:55:28 GMT
- Title: ChiQA: A Large Scale Image-based Real-World Question Answering Dataset
for Multi-Modal Understanding
- Authors: Bingning Wang, Feiyang Lv, Ting Yao, Yiming Yuan, Jin Ma, Yu Luo and
Haijin Liang
- Abstract summary: ChiQA contains more than 40K questions and more than 200K question-images pairs.
ChiQA requires a deep understanding of both language and vision, including grounding, comparisons, and reading.
We evaluate several state-of-the-art visual-language models such as ALBEF, demonstrating that there is still a large room for improvements on ChiQA.
- Score: 42.5118058527339
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual question answering is an important task in both natural language and
vision understanding. However, in most of the public visual question answering
datasets such as VQA, CLEVR, the questions are human generated that specific to
the given image, such as `What color are her eyes?'. The human generated
crowdsourcing questions are relatively simple and sometimes have the bias
toward certain entities or attributes. In this paper, we introduce a new
question answering dataset based on image-ChiQA. It contains the real-world
queries issued by internet users, combined with several related open-domain
images. The system should determine whether the image could answer the question
or not. Different from previous VQA datasets, the questions are real-world
image-independent queries that are more various and unbiased. Compared with
previous image-retrieval or image-caption datasets, the ChiQA not only measures
the relatedness but also measures the answerability, which demands more
fine-grained vision and language reasoning. ChiQA contains more than 40K
questions and more than 200K question-images pairs. A three-level 2/1/0 label
is assigned to each pair indicating perfect answer, partially answer and
irrelevant. Data analysis shows ChiQA requires a deep understanding of both
language and vision, including grounding, comparisons, and reading. We evaluate
several state-of-the-art visual-language models such as ALBEF, demonstrating
that there is still a large room for improvements on ChiQA.
Related papers
- Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Making the V in Text-VQA Matter [1.2962828085662563]
Text-based VQA aims at answering questions by reading the text present in the images.
Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image.
The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
arXiv Detail & Related papers (2023-08-01T05:28:13Z) - Context-VQA: Towards Context-Aware and Purposeful Visual Question
Answering [17.675630617265288]
Visual question answering (VQA) has the potential to make the Internet more accessible in an interactive way.
People who are blind or have low-vision prefer image explanations that incorporate the context in which an image appears.
We argue that VQA models will not fully succeed at meeting people's needs unless they take context into account.
arXiv Detail & Related papers (2023-07-28T18:01:08Z) - A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge [39.788346536244504]
A-OKVQA is a crowdsourced dataset composed of about 25K questions.
We demonstrate the potential of this new dataset through a detailed analysis of its contents.
arXiv Detail & Related papers (2022-06-03T17:52:27Z) - Grounding Answers for Visual Questions Asked by Visually Impaired People [16.978747012406266]
VizWiz-VQA-Grounding is the first dataset that visually grounds answers to visual questions asked by people with visual impairments.
We analyze our dataset and compare it with five VQA-Grounding datasets to demonstrate what makes it similar and different.
arXiv Detail & Related papers (2022-02-04T06:47:16Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - CapWAP: Captioning with a Purpose [56.99405135645775]
We propose a new task, Captioning with a Purpose (CapWAP)
Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population.
We show that it is possible to use reinforcement learning to directly optimize for the intended information need.
arXiv Detail & Related papers (2020-11-09T09:23:55Z) - Visual Question Answering on Image Sets [70.4472272672716]
We introduce the task of Image-Set Visual Question Answering (ISVQA), which generalizes the commonly studied single-image VQA problem to multi-image settings.
Taking a natural language question and a set of images as input, it aims to answer the question based on the content of the images.
The questions can be about objects and relationships in one or more images or about the entire scene depicted by the image set.
arXiv Detail & Related papers (2020-08-27T08:03:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.