Encyclopedic VQA: Visual questions about detailed properties of
fine-grained categories
- URL: http://arxiv.org/abs/2306.09224v2
- Date: Mon, 24 Jul 2023 15:05:55 GMT
- Title: Encyclopedic VQA: Visual questions about detailed properties of
fine-grained categories
- Authors: Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe
Cadar, Howard Zhou, Fei Sha, Andr\'e Araujo, Vittorio Ferrari
- Abstract summary: Encyclopedic-VQA is a large scale visual question answering dataset.
It contains 221k unique question+answer pairs each matched with (up to) 5 images.
Our dataset comes with a controlled knowledge base derived from Wikipedia.
- Score: 41.2406955639537
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose Encyclopedic-VQA, a large scale visual question answering (VQA)
dataset featuring visual questions about detailed properties of fine-grained
categories and instances. It contains 221k unique question+answer pairs each
matched with (up to) 5 images, resulting in a total of 1M VQA samples.
Moreover, our dataset comes with a controlled knowledge base derived from
Wikipedia, marking the evidence to support each answer. Empirically, we show
that our dataset poses a hard challenge for large vision+language models as
they perform poorly on our dataset: PaLI [14] is state-of-the-art on OK-VQA
[37], yet it only achieves 13.0% accuracy on our dataset. Moreover, we
experimentally show that progress on answering our encyclopedic questions can
be achieved by augmenting large models with a mechanism that retrieves relevant
information from the knowledge base. An oracle experiment with perfect
retrieval achieves 87.0% accuracy on the single-hop portion of our dataset, and
an automatic retrieval-augmented prototype yields 48.8%. We believe that our
dataset enables future research on retrieval-augmented vision+language models.
It is available at
https://github.com/google-research/google-research/tree/master/encyclopedic_vqa .
Related papers
- EchoSight: Advancing Visual-Language Models with Wiki Knowledge [39.02148880719576]
We introduce EchoSight, a novel framework for knowledge-based Visual Question Answering.
To strive for high-performing retrieval, EchoSight first searches wiki articles by using visual-only information.
Our experimental results on the Encyclopedic VQA and InfoSeek datasets demonstrate that EchoSight establishes new state-of-the-art results in knowledge-based VQA.
arXiv Detail & Related papers (2024-07-17T16:55:42Z) - KET-QA: A Dataset for Knowledge Enhanced Table Question Answering [63.56707527868466]
We propose to use a knowledge base (KB) as the external knowledge source for TableQA.
Every question requires the integration of information from both the table and the sub-graph to be answered.
We design a retriever-reasoner structured pipeline model to extract pertinent information from the vast knowledge sub-graph.
arXiv Detail & Related papers (2024-05-13T18:26:32Z) - SnapNTell: Enhancing Entity-Centric Visual Question Answering with
Retrieval Augmented Multimodal LLM [48.15067480282839]
This work introduces a novel evaluative benchmark named textbfSnapNTell, specifically tailored for entity-centric VQA.
The dataset is organized into 22 major categories, containing 7,568 unique entities in total.
Our approach markedly outperforms existing methods on the SnapNTell dataset, achieving a 66.5% improvement in the BELURT score.
arXiv Detail & Related papers (2024-03-07T18:38:17Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge [39.788346536244504]
A-OKVQA is a crowdsourced dataset composed of about 25K questions.
We demonstrate the potential of this new dataset through a detailed analysis of its contents.
arXiv Detail & Related papers (2022-06-03T17:52:27Z) - Beyond Accuracy: A Consolidated Tool for Visual Question Answering
Benchmarking [30.155625852894797]
We propose a browser-based benchmarking tool for researchers and challenge organizers.
Our tool helps test generalization capabilities of models across multiple datasets.
Interactive filtering facilitates discovery of problematic behavior.
arXiv Detail & Related papers (2021-10-11T11:08:35Z) - COVIDRead: A Large-scale Question Answering Dataset on COVID-19 [41.23094507923245]
We present a very important resource, COVIDRead, a Stanford Question Answering dataset (SQuAD) like dataset over more than 100k question-answer pairs.
This is a precious resource that could serve many purposes, ranging from common people queries regarding this very uncommon disease to managing articles by editors/associate editors of a journal.
We establish several end-to-end neural network based baseline models that attain the lowest F1 of 32.03% and the highest F1 of 37.19%.
arXiv Detail & Related papers (2021-10-05T07:38:06Z) - Rapidly Bootstrapping a Question Answering Dataset for COVID-19 [88.86456834766288]
We present CovidQA, the beginnings of a question answering dataset specifically designed for COVID-19.
This is the first publicly available resource of its type, and intended as a stopgap measure for guiding research until more substantial evaluation resources become available.
arXiv Detail & Related papers (2020-04-23T17:35:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.