Finding the Evidence: Localization-aware Answer Prediction for Text
Visual Question Answering
- URL: http://arxiv.org/abs/2010.02582v1
- Date: Tue, 6 Oct 2020 09:46:20 GMT
- Title: Finding the Evidence: Localization-aware Answer Prediction for Text
Visual Question Answering
- Authors: Wei Han and Hantao Huang and Tao Han
- Abstract summary: This paper proposes a localization-aware answer prediction network (LaAP-Net) to address this challenge.
Our LaAP-Net not only generates the answer to the question but also predicts a bounding box as evidence of the generated answer.
Our proposed LaAP-Net outperforms existing approaches on three benchmark datasets for the text VQA task by a noticeable margin.
- Score: 8.81824569181583
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image text carries essential information to understand the scene and perform
reasoning. Text-based visual question answering (text VQA) task focuses on
visual questions that require reading text in images. Existing text VQA systems
generate an answer by selecting from optical character recognition (OCR) texts
or a fixed vocabulary. Positional information of text is underused and there is
a lack of evidence for the generated answer. As such, this paper proposes a
localization-aware answer prediction network (LaAP-Net) to address this
challenge. Our LaAP-Net not only generates the answer to the question but also
predicts a bounding box as evidence of the generated answer. Moreover, a
context-enriched OCR representation (COR) for multimodal fusion is proposed to
facilitate the localization task. Our proposed LaAP-Net outperforms existing
approaches on three benchmark datasets for the text VQA task by a noticeable
margin.
Related papers
- See then Tell: Enhancing Key Information Extraction with Vision Grounding [54.061203106565706]
We introduce STNet (See then Tell Net), a novel end-to-end model designed to deliver precise answers with relevant vision grounding.
To enhance the model's seeing capabilities, we collect extensive structured table recognition datasets.
arXiv Detail & Related papers (2024-09-29T06:21:05Z) - Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering [50.52792174648067]
This initiative seeks to bridge the gap between textual and visual comprehension.
We propose a new multi-task Urdu scene text dataset comprising over 1000 natural scene images.
We provide fine-grained annotations for text instances, addressing the limitations of previous datasets.
arXiv Detail & Related papers (2024-05-21T06:48:26Z) - TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities.
We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding.
Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Look, Read and Ask: Learning to Ask Questions by Reading Text in Images [3.3972119795940525]
We present a novel problem of text-based visual question generation or TextVQG.
To address TextVQG, we present an OCR consistent visual question generation model that Looks into the visual content, Reads the scene text, and Asks a relevant and meaningful natural language question.
arXiv Detail & Related papers (2022-11-23T13:52:46Z) - Text-Aware Dual Routing Network for Visual Question Answering [11.015339851906287]
Existing approaches often fail in cases that require reading and understanding text in images to answer questions.
We propose a Text-Aware Dual Routing Network (TDR) which simultaneously handles the VQA cases with and without understanding text information in the input images.
In the branch that involves text understanding, we incorporate the Optical Character Recognition (OCR) features into the model to help understand the text in the images.
arXiv Detail & Related papers (2022-11-17T02:02:11Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling [12.233796960280944]
Text-VQA (Visual Question Answering) aims at question answering through reading text information in images.
LOGOS is a novel model which attempts to tackle this problem from multiple aspects.
arXiv Detail & Related papers (2021-08-20T01:31:51Z) - CapWAP: Captioning with a Purpose [56.99405135645775]
We propose a new task, Captioning with a Purpose (CapWAP)
Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population.
We show that it is possible to use reinforcement learning to directly optimize for the intended information need.
arXiv Detail & Related papers (2020-11-09T09:23:55Z) - RUArt: A Novel Text-Centered Solution for Text-Based Visual Question
Answering [14.498144268367541]
We propose a novel text-centered method called RUArt (Reading, Understanding and Answering the Related Text) for text-based VQA.
We evaluate our RUArt on two text-based VQA benchmarks (ST-VQA and TextVQA) and conduct extensive ablation studies for exploring the reasons behind RUArt's effectiveness.
arXiv Detail & Related papers (2020-10-24T15:37:09Z) - Multimodal grid features and cell pointers for Scene Text Visual
Question Answering [7.834170106487722]
This paper presents a new model for the task of scene text visual question answering.
It is based on an attention mechanism that attends to multi-modal features conditioned to the question.
Experiments demonstrate competitive performance in two standard datasets.
arXiv Detail & Related papers (2020-06-01T13:17:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.