IconQA: A New Benchmark for Abstract Diagram Understanding and Visual
Language Reasoning
- URL: http://arxiv.org/abs/2110.13214v1
- Date: Mon, 25 Oct 2021 18:52:26 GMT
- Title: IconQA: A New Benchmark for Abstract Diagram Understanding and Visual
Language Reasoning
- Authors: Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou
Yu, Xiaodan Liang, Song-Chun Zhu
- Abstract summary: We introduce a new challenge of Icon Question Answering (IconQA) with the goal of answering a question in an icon image context.
We release IconQA, a large-scale dataset that consists of 107,439 questions and three sub-tasks: multi-image-choice, multi-text-choice, and filling-in-the-blank.
We further release an icon dataset Icon645 which contains 645,687 colored icons on 377 classes.
- Score: 132.49090098391258
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current visual question answering (VQA) tasks mainly consider answering
human-annotated questions for natural images. However, aside from natural
images, abstract diagrams with semantic richness are still understudied in
visual understanding and reasoning research. In this work, we introduce a new
challenge of Icon Question Answering (IconQA) with the goal of answering a
question in an icon image context. We release IconQA, a large-scale dataset
that consists of 107,439 questions and three sub-tasks: multi-image-choice,
multi-text-choice, and filling-in-the-blank. The IconQA dataset is inspired by
real-world diagram word problems that highlight the importance of abstract
diagram understanding and comprehensive cognitive reasoning. Thus, IconQA
requires not only perception skills like object recognition and text
understanding, but also diverse cognitive reasoning skills, such as geometric
reasoning, commonsense reasoning, and arithmetic reasoning. To facilitate
potential IconQA models to learn semantic representations for icon images, we
further release an icon dataset Icon645 which contains 645,687 colored icons on
377 classes. We conduct extensive user studies and blind experiments and
reproduce a wide range of advanced VQA methods to benchmark the IconQA task.
Also, we develop a strong IconQA baseline Patch-TRM that applies a pyramid
cross-modal Transformer with input diagram embeddings pre-trained on the icon
dataset. IconQA and Icon645 are available at https://iconqa.github.io.
Related papers
- SimpsonsVQA: Enhancing Inquiry-Based Learning with a Tailored Dataset [11.729464930866483]
"SimpsonsVQA" is a novel dataset for VQA derived from The Simpsons TV show.
It is designed to address not only the traditional VQA task but also to identify irrelevant questions related to images.
SimpsonsVQA contains approximately 23K images, 166K QA pairs, and 500K judgments.
arXiv Detail & Related papers (2024-10-30T02:30:40Z) - Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts? [62.984473889987605]
We present a zero-shot framework for fine-grained visual concept learning by leveraging large language model and Visual Question Answering (VQA) system.
We pose these questions along with the query image to a VQA system and aggregate the answers to determine the presence or absence of an object in the test images.
Our experiments demonstrate comparable performance with existing zero-shot visual classification methods and few-shot concept learning approaches.
arXiv Detail & Related papers (2024-10-17T15:16:10Z) - COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes [14.603382370403]
We formulate visual lateral thinking as a multiple-choice question-answering task.
We describe a three-step taxonomy-driven methodology for instantiating task examples.
We develop COLUMBUS, a synthetic benchmark that applies the task pipeline to create QA sets with text and icon rebus puzzles.
arXiv Detail & Related papers (2024-09-06T06:49:55Z) - Cross-Modal Contrastive Learning for Robust Reasoning in VQA [76.1596796687494]
Multi-modal reasoning in visual question answering (VQA) has witnessed rapid progress recently.
Most reasoning models heavily rely on shortcuts learned from training data.
We propose a simple but effective cross-modal contrastive learning strategy to get rid of the shortcut reasoning.
arXiv Detail & Related papers (2022-11-21T05:32:24Z) - A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge [39.788346536244504]
A-OKVQA is a crowdsourced dataset composed of about 25K questions.
We demonstrate the potential of this new dataset through a detailed analysis of its contents.
arXiv Detail & Related papers (2022-06-03T17:52:27Z) - VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks
for Visual Question Answering [79.22069768972207]
We propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations.
Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context.
On two challenging VQA tasks, our method outperforms strong baseline VQA methods by 3.2% on VCR and 4.6% on GQA, suggesting its strength in performing concept-level reasoning.
arXiv Detail & Related papers (2022-05-23T17:55:34Z) - MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media
Knowledge Extraction and Grounding [131.8797942031366]
We present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text.
Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.
We introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task.
arXiv Detail & Related papers (2021-12-20T18:23:30Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - Cross-modal Knowledge Reasoning for Knowledge-based Visual Question
Answering [27.042604046441426]
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image.
In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views.
We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol.
We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
arXiv Detail & Related papers (2020-08-31T23:25:01Z) - REXUP: I REason, I EXtract, I UPdate with Structured Compositional
Reasoning for Visual Question Answering [4.02726934790798]
We propose a deep reasoning VQA model with explicit visual structure-aware textual information.
REXUP network consists of two branches, image object-oriented and scene graph oriented, which jointly works with super-diagonal fusion compositional attention network.
Our best model significantly outperforms the precious state-of-the-art, which delivers 92.7% on the validation set and 73.1% on the test-dev set.
arXiv Detail & Related papers (2020-07-27T00:54:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.