PreSTU: Pre-Training for Scene-Text Understanding
- URL: http://arxiv.org/abs/2209.05534v3
- Date: Sun, 20 Aug 2023 00:25:44 GMT
- Title: PreSTU: Pre-Training for Scene-Text Understanding
- Authors: Jihyung Kil, Soravit Changpinyo, Xi Chen, Hexiang Hu, Sebastian
Goodman, Wei-Lun Chao, and Radu Soricut
- Abstract summary: We propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU)
PreSTU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and connect it to the rest of the image content.
We empirically demonstrate the effectiveness of this pre-training approach on eight visual question answering and four image captioning benchmarks.
- Score: 49.288302725486226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ability to recognize and reason about text embedded in visual inputs is
often lacking in vision-and-language (V&L) models, perhaps because V&L
pre-training methods have often failed to include such an ability in their
training objective. In this paper, we propose PreSTU, a novel pre-training
recipe dedicated to scene-text understanding (STU). PreSTU introduces OCR-aware
pre-training objectives that encourage the model to recognize text from an
image and connect it to the rest of the image content. We implement PreSTU
using a simple transformer-based encoder-decoder architecture, combined with
large-scale image-text datasets with scene text obtained from an off-the-shelf
OCR system. We empirically demonstrate the effectiveness of this pre-training
approach on eight visual question answering and four image captioning
benchmarks.
Related papers
- VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding [18.609441902943445]
VisFocus is an OCR-free method designed to better exploit the vision encoder's capacity by coupling it directly with the language prompt.
We pair the architecture enhancements with a novel pre-training task, using language masking on a snippet of the document text fed to the visual encoder.
Our experiments demonstrate that this prompt-guided visual encoding approach significantly improves performance.
arXiv Detail & Related papers (2024-07-17T14:16:46Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework.
We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images.
We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual
Learning [31.622393984150314]
We propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation.
We build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text.
arXiv Detail & Related papers (2021-06-03T12:50:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.