Making the V in Text-VQA Matter
- URL: http://arxiv.org/abs/2308.00295v1
- Date: Tue, 1 Aug 2023 05:28:13 GMT
- Title: Making the V in Text-VQA Matter
- Authors: Shamanthak Hegde, Soumya Jahagirdar and Shankar Gangisetty
- Abstract summary: Text-based VQA aims at answering questions by reading the text present in the images.
Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image.
The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
- Score: 1.2962828085662563
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-based VQA aims at answering questions by reading the text present in the
images. It requires a large amount of scene-text relationship understanding
compared to the VQA task. Recent studies have shown that the question-answer
pairs in the dataset are more focused on the text present in the image but less
importance is given to visual features and some questions do not require
understanding the image. The models trained on this dataset predict biased
answers due to the lack of understanding of visual context. For example, in
questions like "What is written on the signboard?", the answer predicted by the
model is always "STOP" which makes the model to ignore the image. To address
these issues, we propose a method to learn visual features (making V matter in
TextVQA) along with the OCR features and question features using VQA dataset as
external knowledge for Text-based VQA. Specifically, we combine the TextVQA
dataset and VQA dataset and train the model on this combined dataset. Such a
simple, yet effective approach increases the understanding and correlation
between the image features and text present in the image, which helps in the
better answering of questions. We further test the model on different datasets
and compare their qualitative and quantitative results.
Related papers
- A Comprehensive Survey on Visual Question Answering Datasets and Algorithms [1.941892373913038]
We meticulously analyze the current state of VQA datasets and models, while cleanly dividing them into distinct categories and then summarizing the methodologies and characteristics of each category.
We explore six main paradigms of VQA models: fusion, attention, the technique of using information from one modality to filter information from another, external knowledge base, composition or reasoning, and graph models.
arXiv Detail & Related papers (2024-11-17T18:52:06Z) - ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images [1.2529442734851663]
We introduce the first large-scale dataset in Vietnamese specializing in the ability to understand text appearing in images.
We uncover the significance of the order in which tokens in OCR text are processed and selected to formulate answers.
arXiv Detail & Related papers (2024-04-16T15:28:30Z) - Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Look, Read and Ask: Learning to Ask Questions by Reading Text in Images [3.3972119795940525]
We present a novel problem of text-based visual question generation or TextVQG.
To address TextVQG, we present an OCR consistent visual question generation model that Looks into the visual content, Reads the scene text, and Asks a relevant and meaningful natural language question.
arXiv Detail & Related papers (2022-11-23T13:52:46Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - Grounding Answers for Visual Questions Asked by Visually Impaired People [16.978747012406266]
VizWiz-VQA-Grounding is the first dataset that visually grounds answers to visual questions asked by people with visual impairments.
We analyze our dataset and compare it with five VQA-Grounding datasets to demonstrate what makes it similar and different.
arXiv Detail & Related papers (2022-02-04T06:47:16Z) - Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in
Visual Question Answering [71.6781118080461]
We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task.
firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information.
Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question.
Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
arXiv Detail & Related papers (2021-12-14T10:01:26Z) - Structured Multimodal Attentions for TextVQA [57.71060302874151]
We propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above.
SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it.
Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP.
arXiv Detail & Related papers (2020-06-01T07:07:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.