Question-controlled Text-aware Image Captioning
- URL: http://arxiv.org/abs/2108.02059v1
- Date: Wed, 4 Aug 2021 13:34:54 GMT
- Title: Question-controlled Text-aware Image Captioning
- Authors: Anwen Hu, Shizhe Chen, Qin Jin
- Abstract summary: Question-controlled Text-aware Image Captioning (Qc-TextCap) is a new challenging task.
With questions as control signals, our model generates more informative and diverse captions than the state-of-the-art text-aware captioning model.
GQAM generates a personalized text-aware caption with a Multimodal Decoder.
- Score: 41.53906032024941
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For an image with multiple scene texts, different people may be interested in
different text information. Current text-aware image captioning models are not
able to generate distinctive captions according to various information needs.
To explore how to generate personalized text-aware captions, we define a new
challenging task, namely Question-controlled Text-aware Image Captioning
(Qc-TextCap). With questions as control signals, this task requires models to
understand questions, find related scene texts and describe them together with
objects fluently in human language. Based on two existing text-aware captioning
datasets, we automatically construct two datasets, ControlTextCaps and
ControlVizWiz to support the task. We propose a novel Geometry and Question
Aware Model (GQAM). GQAM first applies a Geometry-informed Visual Encoder to
fuse region-level object features and region-level scene text features with
considering spatial relationships. Then, we design a Question-guided Encoder to
select the most relevant visual features for each question. Finally, GQAM
generates a personalized text-aware caption with a Multimodal Decoder. Our
model achieves better captioning performance and question answering ability
than carefully designed baselines on both two datasets. With questions as
control signals, our model generates more informative and diverse captions than
the state-of-the-art text-aware captioning model. Our code and datasets are
publicly available at https://github.com/HAWLYQ/Qc-TextCap.
Related papers
- Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering [50.52792174648067]
This initiative seeks to bridge the gap between textual and visual comprehension.
We propose a new multi-task Urdu scene text dataset comprising over 1000 natural scene images.
We provide fine-grained annotations for text instances, addressing the limitations of previous datasets.
arXiv Detail & Related papers (2024-05-21T06:48:26Z) - Making the V in Text-VQA Matter [1.2962828085662563]
Text-based VQA aims at answering questions by reading the text present in the images.
Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image.
The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
arXiv Detail & Related papers (2023-08-01T05:28:13Z) - Locate Then Generate: Bridging Vision and Language with Bounding Box for
Scene-Text VQA [15.74007067413724]
We propose a novel framework for Scene Text Visual Question Answering (STVQA)
It requires models to read scene text in images for question answering.
arXiv Detail & Related papers (2023-04-04T07:46:40Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Character-Centric Story Visualization via Visual Planning and Token
Alignment [53.44760407148918]
Story visualization advances the traditional text-to-image generation by enabling multiple image generation based on a complete story.
Key challenge of consistent story visualization is to preserve characters that are essential in stories.
We propose to adapt a recent work that augments Vector-Quantized Variational Autoencoders with a text-tovisual-token architecture.
arXiv Detail & Related papers (2022-10-16T06:50:39Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling [12.233796960280944]
Text-VQA (Visual Question Answering) aims at question answering through reading text information in images.
LOGOS is a novel model which attempts to tackle this problem from multiple aspects.
arXiv Detail & Related papers (2021-08-20T01:31:51Z) - Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language
Representation Learning with Pre-trained Sequence-to-Sequence Model [18.848107244522666]
TextVQA requires models to read and reason about text in images to answer questions about them.
In this challenge, we use generative model T5 for TextVQA task.
arXiv Detail & Related papers (2021-06-24T06:39:37Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.