RUArt: A Novel Text-Centered Solution for Text-Based Visual Question
Answering
- URL: http://arxiv.org/abs/2010.12917v1
- Date: Sat, 24 Oct 2020 15:37:09 GMT
- Title: RUArt: A Novel Text-Centered Solution for Text-Based Visual Question
Answering
- Authors: Zan-Xia Jin, Heran Wu, Chun Yang, Fang Zhou, Jingyan Qin, Lei Xiao and
Xu-Cheng Yin
- Abstract summary: We propose a novel text-centered method called RUArt (Reading, Understanding and Answering the Related Text) for text-based VQA.
We evaluate our RUArt on two text-based VQA benchmarks (ST-VQA and TextVQA) and conduct extensive ablation studies for exploring the reasons behind RUArt's effectiveness.
- Score: 14.498144268367541
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-based visual question answering (VQA) requires to read and understand
text in an image to correctly answer a given question. However, most current
methods simply add optical character recognition (OCR) tokens extracted from
the image into the VQA model without considering contextual information of OCR
tokens and mining the relationships between OCR tokens and scene objects. In
this paper, we propose a novel text-centered method called RUArt (Reading,
Understanding and Answering the Related Text) for text-based VQA. Taking an
image and a question as input, RUArt first reads the image and obtains text and
scene objects. Then, it understands the question, OCRed text and objects in the
context of the scene, and further mines the relationships among them. Finally,
it answers the related text for the given question through text semantic
matching and reasoning. We evaluate our RUArt on two text-based VQA benchmarks
(ST-VQA and TextVQA) and conduct extensive ablation studies for exploring the
reasons behind RUArt's effectiveness. Experimental results demonstrate that our
method can effectively explore the contextual information of the text and mine
the stable relationships between the text and objects.
Related papers
- Scene-Text Grounding for Text-Based Video Question Answering [97.1112579979614]
Existing efforts in text-based video question answering (TextVideoQA) are criticized for their opaque decisionmaking and reliance on scene-text recognition.
We study Grounded TextVideoQA by forcing models to answer questions and interpret relevant scene-text regions.
arXiv Detail & Related papers (2024-09-22T05:13:11Z) - Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering [50.52792174648067]
This initiative seeks to bridge the gap between textual and visual comprehension.
We propose a new multi-task Urdu scene text dataset comprising over 1000 natural scene images.
We provide fine-grained annotations for text instances, addressing the limitations of previous datasets.
arXiv Detail & Related papers (2024-05-21T06:48:26Z) - Making the V in Text-VQA Matter [1.2962828085662563]
Text-based VQA aims at answering questions by reading the text present in the images.
Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image.
The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
arXiv Detail & Related papers (2023-08-01T05:28:13Z) - Look, Read and Ask: Learning to Ask Questions by Reading Text in Images [3.3972119795940525]
We present a novel problem of text-based visual question generation or TextVQG.
To address TextVQG, we present an OCR consistent visual question generation model that Looks into the visual content, Reads the scene text, and Asks a relevant and meaningful natural language question.
arXiv Detail & Related papers (2022-11-23T13:52:46Z) - Text-Aware Dual Routing Network for Visual Question Answering [11.015339851906287]
Existing approaches often fail in cases that require reading and understanding text in images to answer questions.
We propose a Text-Aware Dual Routing Network (TDR) which simultaneously handles the VQA cases with and without understanding text information in the input images.
In the branch that involves text understanding, we incorporate the Optical Character Recognition (OCR) features into the model to help understand the text in the images.
arXiv Detail & Related papers (2022-11-17T02:02:11Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - CORE-Text: Improving Scene Text Detection with Contrastive Relational
Reasoning [65.57338873921168]
Localizing text instances in natural scenes is regarded as a fundamental challenge in computer vision.
In this work, we quantitatively analyze the sub-text problem and present a simple yet effective design, COntrastive RElation (CORE) module.
We integrate the CORE module into a two-stage text detector of Mask R-CNN and devise our text detector CORE-Text.
arXiv Detail & Related papers (2021-12-14T16:22:25Z) - Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling [12.233796960280944]
Text-VQA (Visual Question Answering) aims at question answering through reading text information in images.
LOGOS is a novel model which attempts to tackle this problem from multiple aspects.
arXiv Detail & Related papers (2021-08-20T01:31:51Z) - Finding the Evidence: Localization-aware Answer Prediction for Text
Visual Question Answering [8.81824569181583]
This paper proposes a localization-aware answer prediction network (LaAP-Net) to address this challenge.
Our LaAP-Net not only generates the answer to the question but also predicts a bounding box as evidence of the generated answer.
Our proposed LaAP-Net outperforms existing approaches on three benchmark datasets for the text VQA task by a noticeable margin.
arXiv Detail & Related papers (2020-10-06T09:46:20Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.