TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
- URL: http://arxiv.org/abs/2012.04638v1
- Date: Tue, 8 Dec 2020 18:55:21 GMT
- Title: TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
- Authors: Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio,
Lijuan Wang, Cha Zhang, Lei Zhang, Jiebo Luo
- Abstract summary: We propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks.
TAP explicitly incorporates scene text (generated from OCR engines) in pre-training.
Our approach outperforms the state of the art by large margins on multiple tasks.
- Score: 75.44716665758415
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and
Text-Caption tasks. These two tasks aim at reading and understanding scene text
in images for question answering and image caption generation, respectively. In
contrast to the conventional vision-language pre-training that fails to capture
scene text and its relationship with the visual and text modalities, TAP
explicitly incorporates scene text (generated from OCR engines) in
pre-training. With three pre-training tasks, including masked language modeling
(MLM), image-text (contrastive) matching (ITM), and relative (spatial) position
prediction (RPP), TAP effectively helps the model learn a better aligned
representation among the three modalities: text word, visual object, and scene
text. Due to this aligned representation learning, even pre-trained on the same
downstream task dataset, TAP already boosts the absolute accuracy on the
TextVQA dataset by +5.4%, compared with a non-TAP baseline. To further improve
the performance, we build a large-scale dataset based on the Conceptual Caption
dataset, named OCR-CC, which contains 1.4 million scene text-related image-text
pairs. Pre-trained on this OCR-CC dataset, our approach outperforms the state
of the art by large margins on multiple tasks, i.e., +8.3% accuracy on TextVQA,
+8.6% accuracy on ST-VQA, and +10.2 CIDEr score on TextCaps.
Related papers
- Bridging Text and Vision: A Multi-View Text-Vision Registration Approach for Cross-Modal Place Recognition [4.562684361220731]
We propose a text-vision registration approach called Text4VPR for place recognition task.
Text4VPR exclusively utilizes textual descriptions to match a database of images.
On Street360Loc, the first text to image VPR dataset we created, Text4VPR builds a robust baseline, achieving a leading top-1 accuracy of 57% and a leading top-10 accuracy of 92% within a 5-meter radius on the test set.
arXiv Detail & Related papers (2025-02-20T02:00:02Z) - SceneVTG++: Controllable Multilingual Visual Text Generation in the Wild [55.619708995575785]
The text in natural scene images needs to meet the following four key criteria.
The generated text can facilitate to the training of natural scene OCR (Optical Character Recognition) tasks.
The generated images have superior utility in OCR tasks like text detection and text recognition.
arXiv Detail & Related papers (2025-01-06T12:09:08Z) - InstructOCR: Instruction Boosting Scene Text Spotting [10.724187109801251]
InstructOCR is an innovative instruction-based scene text spotting model.
Our framework employs both text and image encoders during training and inference.
We achieve state-of-the-art results on widely used benchmarks.
arXiv Detail & Related papers (2024-12-20T03:23:26Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language
Representation Learning with Pre-trained Sequence-to-Sequence Model [18.848107244522666]
TextVQA requires models to read and reason about text in images to answer questions about them.
In this challenge, we use generative model T5 for TextVQA task.
arXiv Detail & Related papers (2021-06-24T06:39:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.