Look, Read and Ask: Learning to Ask Questions by Reading Text in Images
- URL: http://arxiv.org/abs/2211.12950v1
- Date: Wed, 23 Nov 2022 13:52:46 GMT
- Title: Look, Read and Ask: Learning to Ask Questions by Reading Text in Images
- Authors: Soumya Jahagirdar, Shankar Gangisetty, Anand Mishra
- Abstract summary: We present a novel problem of text-based visual question generation or TextVQG.
To address TextVQG, we present an OCR consistent visual question generation model that Looks into the visual content, Reads the scene text, and Asks a relevant and meaningful natural language question.
- Score: 3.3972119795940525
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a novel problem of text-based visual question generation or
TextVQG in short. Given the recent growing interest of the document image
analysis community in combining text understanding with conversational
artificial intelligence, e.g., text-based visual question answering, TextVQG
becomes an important task. TextVQG aims to generate a natural language question
for a given input image and an automatically extracted text also known as OCR
token from it such that the OCR token is an answer to the generated question.
TextVQG is an essential ability for a conversational agent. However, it is
challenging as it requires an in-depth understanding of the scene and the
ability to semantically bridge the visual content with the text present in the
image. To address TextVQG, we present an OCR consistent visual question
generation model that Looks into the visual content, Reads the scene text, and
Asks a relevant and meaningful natural language question. We refer to our
proposed model as OLRA. We perform an extensive evaluation of OLRA on two
public benchmarks and compare them against baselines. Our model OLRA
automatically generates questions similar to the public text-based visual
question answering datasets that were curated manually. Moreover, we
significantly outperform baseline approaches on the performance measures
popularly used in text generation literature.
Related papers
- See then Tell: Enhancing Key Information Extraction with Vision Grounding [54.061203106565706]
We introduce STNet (See then Tell Net), a novel end-to-end model designed to deliver precise answers with relevant vision grounding.
To enhance the model's seeing capabilities, we collect extensive structured table recognition datasets.
arXiv Detail & Related papers (2024-09-29T06:21:05Z) - Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering [50.52792174648067]
This initiative seeks to bridge the gap between textual and visual comprehension.
We propose a new multi-task Urdu scene text dataset comprising over 1000 natural scene images.
We provide fine-grained annotations for text instances, addressing the limitations of previous datasets.
arXiv Detail & Related papers (2024-05-21T06:48:26Z) - Evaluating Text-to-Visual Generation with Image-to-Text Generation [113.07368313330994]
VQAScore is a visual-question-answering (VQA) model to produce an alignment score.
It produces state-of-the-art results across many (8) image-text alignment benchmarks.
We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts.
arXiv Detail & Related papers (2024-04-01T17:58:06Z) - Making the V in Text-VQA Matter [1.2962828085662563]
Text-based VQA aims at answering questions by reading the text present in the images.
Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image.
The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
arXiv Detail & Related papers (2023-08-01T05:28:13Z) - TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation
with Question Answering [86.38098280689027]
We introduce an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA)
We present a comprehensive evaluation of existing text-to-image models using a benchmark consisting of 4K diverse text inputs and 25K questions across 12 categories (object, counting, etc.)
arXiv Detail & Related papers (2023-03-21T14:41:02Z) - Text-Aware Dual Routing Network for Visual Question Answering [11.015339851906287]
Existing approaches often fail in cases that require reading and understanding text in images to answer questions.
We propose a Text-Aware Dual Routing Network (TDR) which simultaneously handles the VQA cases with and without understanding text information in the input images.
In the branch that involves text understanding, we incorporate the Optical Character Recognition (OCR) features into the model to help understand the text in the images.
arXiv Detail & Related papers (2022-11-17T02:02:11Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling [12.233796960280944]
Text-VQA (Visual Question Answering) aims at question answering through reading text information in images.
LOGOS is a novel model which attempts to tackle this problem from multiple aspects.
arXiv Detail & Related papers (2021-08-20T01:31:51Z) - RUArt: A Novel Text-Centered Solution for Text-Based Visual Question
Answering [14.498144268367541]
We propose a novel text-centered method called RUArt (Reading, Understanding and Answering the Related Text) for text-based VQA.
We evaluate our RUArt on two text-based VQA benchmarks (ST-VQA and TextVQA) and conduct extensive ablation studies for exploring the reasons behind RUArt's effectiveness.
arXiv Detail & Related papers (2020-10-24T15:37:09Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.