InfographicVQA
- URL: http://arxiv.org/abs/2104.12756v1
- Date: Mon, 26 Apr 2021 17:45:54 GMT
- Title: InfographicVQA
- Authors: Minesh Mathew, Viraj Bagal, Rub\`en P\'erez Tito, Dimosthenis
Karatzas, Ernest Valveny, C.V Jawahar
- Abstract summary: InfographicVQA is a new dataset that comprises a diverse collection of infographics along with natural language questions and answers annotations.
We curate the dataset with emphasis on questions that require elementary reasoning and basic arithmetic skills.
The dataset, code and leaderboard will be made available at http://docvqa.org.
- Score: 31.084392784258032
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Infographics are documents designed to effectively communicate information
using a combination of textual, graphical and visual elements. In this work, we
explore the automatic understanding of infographic images by using Visual
Question Answering technique.To this end, we present InfographicVQA, a new
dataset that comprises a diverse collection of infographics along with natural
language questions and answers annotations. The collected questions require
methods to jointly reason over the document layout, textual content, graphical
elements, and data visualizations. We curate the dataset with emphasis on
questions that require elementary reasoning and basic arithmetic skills.
Finally, we evaluate two strong baselines based on state of the art multi-modal
VQA models, and establish baseline performance for the new task. The dataset,
code and leaderboard will be made available at http://docvqa.org
Related papers
- A Comprehensive Survey on Visual Question Answering Datasets and Algorithms [1.941892373913038]
We meticulously analyze the current state of VQA datasets and models, while cleanly dividing them into distinct categories and then summarizing the methodologies and characteristics of each category.
We explore six main paradigms of VQA models: fusion, attention, the technique of using information from one modality to filter information from another, external knowledge base, composition or reasoning, and graph models.
arXiv Detail & Related papers (2024-11-17T18:52:06Z) - Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - Enhancing Vision Models for Text-Heavy Content Understanding and Interaction [0.0]
We build a visual chat application integrating CLIP for image encoding and a model from the Massive Text Embedding Benchmark.
The aim of the project is to increase and also enhance the advance vision models' capabilities in understanding complex visual textual data interconnected data.
arXiv Detail & Related papers (2024-05-31T15:17:47Z) - Making the V in Text-VQA Matter [1.2962828085662563]
Text-based VQA aims at answering questions by reading the text present in the images.
Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image.
The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
arXiv Detail & Related papers (2023-08-01T05:28:13Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Understanding the Role of Scene Graphs in Visual Question Answering [26.02889386248289]
We conduct experiments on the GQA dataset which presents a challenging set of questions requiring counting, compositionality and advanced reasoning capability.
We adopt image + question architectures for use with scene graphs, evaluate various scene graph generation techniques for unseen images, propose a training curriculum to leverage human-annotated and auto-generated scene graphs.
We present a multi-faceted study into the use of scene graphs for Visual Question Answering, making this work the first of its kind.
arXiv Detail & Related papers (2021-01-14T07:27:37Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z) - Extracting Summary Knowledge Graphs from Long Documents [48.92130466606231]
We introduce a new text-to-graph task of predicting summarized knowledge graphs from long documents.
We develop a dataset of 200k document/graph pairs using automatic and human annotations.
arXiv Detail & Related papers (2020-09-19T04:37:33Z) - Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene
Text [93.08109196909763]
We propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN)
It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively.
It then introduces three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities.
arXiv Detail & Related papers (2020-03-31T05:56:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.