LEGO: Self-Supervised Representation Learning for Scene Text Images
- URL: http://arxiv.org/abs/2408.02036v1
- Date: Sun, 4 Aug 2024 14:07:14 GMT
- Title: LEGO: Self-Supervised Representation Learning for Scene Text Images
- Authors: Yujin Ren, Jiaxin Zhang, Lianwen Jin,
- Abstract summary: We propose a Local Explicit and Global Order-aware self-supervised representation learning method for scene text images.
Inspired by the human cognitive process of learning words, we propose three novel pre-text tasks for LEGO to model sequential, semantic, and structural features.
The LEGO recognizer achieves superior or comparable performance compared to state-of-the-art scene text recognition methods on six benchmarks.
- Score: 32.21085469233465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, significant progress has been made in scene text recognition by data-driven methods. However, due to the scarcity of annotated real-world data, the training of these methods predominantly relies on synthetic data. The distribution gap between synthetic and real data constrains the further performance improvement of these methods in real-world applications. To tackle this problem, a highly promising approach is to utilize massive amounts of unlabeled real data for self-supervised training, which has been widely proven effective in many NLP and CV tasks. Nevertheless, generic self-supervised methods are unsuitable for scene text images due to their sequential nature. To address this issue, we propose a Local Explicit and Global Order-aware self-supervised representation learning method (LEGO) that accounts for the characteristics of scene text images. Inspired by the human cognitive process of learning words, which involves spelling, reading, and writing, we propose three novel pre-text tasks for LEGO to model sequential, semantic, and structural features, respectively. The entire pre-training process is optimized by using a consistent Text Knowledge Codebook. Extensive experiments validate that LEGO outperforms previous scene text self-supervised methods. The recognizer incorporated with our pre-trained model achieves superior or comparable performance compared to state-of-the-art scene text recognition methods on six benchmarks. Furthermore, we demonstrate that LEGO can achieve superior performance in other text-related tasks.
Related papers
- VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - SCOB: Universal Text Understanding via Character-wise Supervised
Contrastive Learning with Online Text Rendering for Bridging Domain Gap [10.011953474950744]
We propose a novel pre-training method called SCOB that leverages character-wise supervised contrastive learning with online text rendering.
SCOB enables weakly supervised learning, significantly reducing annotation costs.
Our findings suggest that SCOB can be served generally and effectively for read-type pre-training methods.
arXiv Detail & Related papers (2023-09-21T15:06:08Z) - Text-Only Training for Visual Storytelling [107.19873669536523]
We formulate visual storytelling as a visual-conditioned story generation problem.
We propose a text-only training method that separates the learning of cross-modality alignment and story generation.
arXiv Detail & Related papers (2023-08-17T09:32:17Z) - CSSL-MHTR: Continual Self-Supervised Learning for Scalable Multi-script Handwritten Text Recognition [16.987008461171065]
We explore the potential of continual self-supervised learning to alleviate the catastrophic forgetting problem in handwritten text recognition.
Our method consists in adding intermediate layers called adapters for each task, and efficiently distilling knowledge from the previous model while learning the current task.
We attain state-of-the-art performance on English, Italian and Russian scripts, whilst adding only a few parameters per task.
arXiv Detail & Related papers (2023-03-16T14:27:45Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - Reading and Writing: Discriminative and Generative Modeling for
Self-Supervised Text Recognition [101.60244147302197]
We introduce contrastive learning and masked image modeling to learn discrimination and generation of text images.
Our method outperforms previous self-supervised text recognition methods by 10.2%-20.2% on irregular scene text recognition datasets.
Our proposed text recognizer exceeds previous state-of-the-art text recognition methods by averagely 5.3% on 11 benchmarks, with similar model size.
arXiv Detail & Related papers (2022-07-01T03:50:26Z) - Multimodal Semi-Supervised Learning for Text Recognition [10.33262222726707]
We present semi-supervised learning for multimodal text recognizers (SemiMTR) that leverages unlabeled data at each modality training phase.
Our algorithm starts by pretraining the vision model through a single-stage training that unifies self-supervised learning with supervised training.
In a novel setup, consistency is enforced on each modality separately.
arXiv Detail & Related papers (2022-05-08T13:55:30Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.