TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
- URL: http://arxiv.org/abs/2012.04638v1
- Date: Tue, 8 Dec 2020 18:55:21 GMT
- Title: TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
- Authors: Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio,
Lijuan Wang, Cha Zhang, Lei Zhang, Jiebo Luo
- Abstract summary: We propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks.
TAP explicitly incorporates scene text (generated from OCR engines) in pre-training.
Our approach outperforms the state of the art by large margins on multiple tasks.
- Score: 75.44716665758415
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and
Text-Caption tasks. These two tasks aim at reading and understanding scene text
in images for question answering and image caption generation, respectively. In
contrast to the conventional vision-language pre-training that fails to capture
scene text and its relationship with the visual and text modalities, TAP
explicitly incorporates scene text (generated from OCR engines) in
pre-training. With three pre-training tasks, including masked language modeling
(MLM), image-text (contrastive) matching (ITM), and relative (spatial) position
prediction (RPP), TAP effectively helps the model learn a better aligned
representation among the three modalities: text word, visual object, and scene
text. Due to this aligned representation learning, even pre-trained on the same
downstream task dataset, TAP already boosts the absolute accuracy on the
TextVQA dataset by +5.4%, compared with a non-TAP baseline. To further improve
the performance, we build a large-scale dataset based on the Conceptual Caption
dataset, named OCR-CC, which contains 1.4 million scene text-related image-text
pairs. Pre-trained on this OCR-CC dataset, our approach outperforms the state
of the art by large margins on multiple tasks, i.e., +8.3% accuracy on TextVQA,
+8.6% accuracy on ST-VQA, and +10.2 CIDEr score on TextCaps.
Related papers
- ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting [8.397246652127793]
We propose a new pre-training method called OCR-Text Destylization Modeling (ODM)
ODM transfers diverse styles of text found in images to a uniform style based on the text prompt.
Our method significantly improves performance and outperforms current pre-training methods in scene text detection and spotting tasks.
arXiv Detail & Related papers (2024-03-01T06:13:53Z) - Separate and Locate: Rethink the Text in Text-based Visual Question
Answering [15.84929733099542]
We propose Separate and Locate (SaL) to explore text contextual cues and designs spatial position embedding to construct spatial relations between OCR texts.
Our SaL model outperforms the baseline model by 4.44% and 3.96% accuracy on TextVQA and ST-VQA datasets.
arXiv Detail & Related papers (2023-08-31T01:00:59Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language
Representation Learning with Pre-trained Sequence-to-Sequence Model [18.848107244522666]
TextVQA requires models to read and reason about text in images to answer questions about them.
In this challenge, we use generative model T5 for TextVQA task.
arXiv Detail & Related papers (2021-06-24T06:39:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.