TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation
- URL: http://arxiv.org/abs/2208.01813v1
- Date: Wed, 3 Aug 2022 02:18:09 GMT
- Title: TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation
- Authors: Jun Wang, Mingfei Gao, Yuqian Hu, Ramprasaath R. Selvaraju, Chetan
Ramaiah, Ran Xu, Joseph F. JaJa, Larry S. Davis
- Abstract summary: Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
- Score: 55.83319599681002
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-VQA aims at answering questions that require understanding the textual
cues in an image. Despite the great progress of existing Text-VQA methods,
their performance suffers from insufficient human-labeled question-answer (QA)
pairs. However, we observe that, in general, the scene text is not fully
exploited in the existing datasets -- only a small portion of text in each
image participates in the annotated QA activities. This results in a huge waste
of useful information. To address this deficiency, we develop a new method to
generate high-quality and diverse QA pairs by explicitly utilizing the existing
rich text available in the scene context of each image. Specifically, we
propose, TAG, a text-aware visual question-answer generation architecture that
learns to produce meaningful, and accurate QA samples using a multimodal
transformer. The architecture exploits underexplored scene text information and
enhances scene understanding of Text-VQA models by combining the generated QA
pairs with the initial training data. Extensive experimental results on two
well-known Text-VQA benchmarks (TextVQA and ST-VQA) demonstrate that our
proposed TAG effectively enlarges the training data that helps improve the
Text-VQA performance without extra labeling effort. Moreover, our model
outperforms state-of-the-art approaches that are pre-trained with extra
large-scale data. Code will be made publicly available.
Related papers
- MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering [58.92057773071854]
We introduce MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages.
MTVQA is the first benchmark featuring high-quality human expert annotations across 9 diverse languages.
arXiv Detail & Related papers (2024-05-20T12:35:01Z) - Making the V in Text-VQA Matter [1.2962828085662563]
Text-based VQA aims at answering questions by reading the text present in the images.
Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image.
The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
arXiv Detail & Related papers (2023-08-01T05:28:13Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Look, Read and Ask: Learning to Ask Questions by Reading Text in Images [3.3972119795940525]
We present a novel problem of text-based visual question generation or TextVQG.
To address TextVQG, we present an OCR consistent visual question generation model that Looks into the visual content, Reads the scene text, and Asks a relevant and meaningful natural language question.
arXiv Detail & Related papers (2022-11-23T13:52:46Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling [12.233796960280944]
Text-VQA (Visual Question Answering) aims at question answering through reading text information in images.
LOGOS is a novel model which attempts to tackle this problem from multiple aspects.
arXiv Detail & Related papers (2021-08-20T01:31:51Z) - TAP: Text-Aware Pre-training for Text-VQA and Text-Caption [75.44716665758415]
We propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks.
TAP explicitly incorporates scene text (generated from OCR engines) in pre-training.
Our approach outperforms the state of the art by large margins on multiple tasks.
arXiv Detail & Related papers (2020-12-08T18:55:21Z) - Structured Multimodal Attentions for TextVQA [57.71060302874151]
We propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above.
SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it.
Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP.
arXiv Detail & Related papers (2020-06-01T07:07:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.