IconQA: A New Benchmark for Abstract Diagram Understanding and Visual
Language Reasoning
- URL: http://arxiv.org/abs/2110.13214v1
- Date: Mon, 25 Oct 2021 18:52:26 GMT
- Title: IconQA: A New Benchmark for Abstract Diagram Understanding and Visual
Language Reasoning
- Authors: Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou
Yu, Xiaodan Liang, Song-Chun Zhu
- Abstract summary: We introduce a new challenge of Icon Question Answering (IconQA) with the goal of answering a question in an icon image context.
We release IconQA, a large-scale dataset that consists of 107,439 questions and three sub-tasks: multi-image-choice, multi-text-choice, and filling-in-the-blank.
We further release an icon dataset Icon645 which contains 645,687 colored icons on 377 classes.
- Score: 132.49090098391258
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current visual question answering (VQA) tasks mainly consider answering
human-annotated questions for natural images. However, aside from natural
images, abstract diagrams with semantic richness are still understudied in
visual understanding and reasoning research. In this work, we introduce a new
challenge of Icon Question Answering (IconQA) with the goal of answering a
question in an icon image context. We release IconQA, a large-scale dataset
that consists of 107,439 questions and three sub-tasks: multi-image-choice,
multi-text-choice, and filling-in-the-blank. The IconQA dataset is inspired by
real-world diagram word problems that highlight the importance of abstract
diagram understanding and comprehensive cognitive reasoning. Thus, IconQA
requires not only perception skills like object recognition and text
understanding, but also diverse cognitive reasoning skills, such as geometric
reasoning, commonsense reasoning, and arithmetic reasoning. To facilitate
potential IconQA models to learn semantic representations for icon images, we
further release an icon dataset Icon645 which contains 645,687 colored icons on
377 classes. We conduct extensive user studies and blind experiments and
reproduce a wide range of advanced VQA methods to benchmark the IconQA task.
Also, we develop a strong IconQA baseline Patch-TRM that applies a pyramid
cross-modal Transformer with input diagram embeddings pre-trained on the icon
dataset. IconQA and Icon645 are available at https://iconqa.github.io.
Related papers
- Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - A Symbolic Character-Aware Model for Solving Geometry Problems [18.68829580108664]
In the text description, symbolic characters such as "$triangle$ABC" often serve as a bridge to connect the corresponding diagram.
We develop a symbolic character-aware model to fully explore the role of these characters in both text and diagram understanding.
arXiv Detail & Related papers (2023-08-05T08:56:55Z) - Cross-Modal Contrastive Learning for Robust Reasoning in VQA [76.1596796687494]
Multi-modal reasoning in visual question answering (VQA) has witnessed rapid progress recently.
Most reasoning models heavily rely on shortcuts learned from training data.
We propose a simple but effective cross-modal contrastive learning strategy to get rid of the shortcut reasoning.
arXiv Detail & Related papers (2022-11-21T05:32:24Z) - ChiQA: A Large Scale Image-based Real-World Question Answering Dataset
for Multi-Modal Understanding [42.5118058527339]
ChiQA contains more than 40K questions and more than 200K question-images pairs.
ChiQA requires a deep understanding of both language and vision, including grounding, comparisons, and reading.
We evaluate several state-of-the-art visual-language models such as ALBEF, demonstrating that there is still a large room for improvements on ChiQA.
arXiv Detail & Related papers (2022-08-05T07:55:28Z) - A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge [39.788346536244504]
A-OKVQA is a crowdsourced dataset composed of about 25K questions.
We demonstrate the potential of this new dataset through a detailed analysis of its contents.
arXiv Detail & Related papers (2022-06-03T17:52:27Z) - VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks
for Visual Question Answering [79.22069768972207]
We propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations.
Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context.
On two challenging VQA tasks, our method outperforms strong baseline VQA methods by 3.2% on VCR and 4.6% on GQA, suggesting its strength in performing concept-level reasoning.
arXiv Detail & Related papers (2022-05-23T17:55:34Z) - MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media
Knowledge Extraction and Grounding [131.8797942031366]
We present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text.
Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.
We introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task.
arXiv Detail & Related papers (2021-12-20T18:23:30Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - Cross-modal Knowledge Reasoning for Knowledge-based Visual Question
Answering [27.042604046441426]
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image.
In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views.
We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol.
We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
arXiv Detail & Related papers (2020-08-31T23:25:01Z) - REXUP: I REason, I EXtract, I UPdate with Structured Compositional
Reasoning for Visual Question Answering [4.02726934790798]
We propose a deep reasoning VQA model with explicit visual structure-aware textual information.
REXUP network consists of two branches, image object-oriented and scene graph oriented, which jointly works with super-diagonal fusion compositional attention network.
Our best model significantly outperforms the precious state-of-the-art, which delivers 92.7% on the validation set and 73.1% on the test-dev set.
arXiv Detail & Related papers (2020-07-27T00:54:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.