SCOB: Universal Text Understanding via Character-wise Supervised
Contrastive Learning with Online Text Rendering for Bridging Domain Gap
- URL: http://arxiv.org/abs/2309.12382v1
- Date: Thu, 21 Sep 2023 15:06:08 GMT
- Title: SCOB: Universal Text Understanding via Character-wise Supervised
Contrastive Learning with Online Text Rendering for Bridging Domain Gap
- Authors: Daehee Kim, Yoonsik Kim, DongHyun Kim, Yumin Lim, Geewook Kim, Taeho
Kil
- Abstract summary: We propose a novel pre-training method called SCOB that leverages character-wise supervised contrastive learning with online text rendering.
SCOB enables weakly supervised learning, significantly reducing annotation costs.
Our findings suggest that SCOB can be served generally and effectively for read-type pre-training methods.
- Score: 10.011953474950744
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inspired by the great success of language model (LM)-based pre-training,
recent studies in visual document understanding have explored LM-based
pre-training methods for modeling text within document images. Among them,
pre-training that reads all text from an image has shown promise, but often
exhibits instability and even fails when applied to broader domains, such as
those involving both visual documents and scene text images. This is a
substantial limitation for real-world scenarios, where the processing of text
image inputs in diverse domains is essential. In this paper, we investigate
effective pre-training tasks in the broader domains and also propose a novel
pre-training method called SCOB that leverages character-wise supervised
contrastive learning with online text rendering to effectively pre-train
document and scene text domains by bridging the domain gap. Moreover, SCOB
enables weakly supervised learning, significantly reducing annotation costs.
Extensive benchmarks demonstrate that SCOB generally improves vanilla
pre-training methods and achieves comparable performance to state-of-the-art
methods. Our findings suggest that SCOB can be served generally and effectively
for read-type pre-training methods. The code will be available at
https://github.com/naver-ai/scob.
Related papers
- Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - LEGO: Self-Supervised Representation Learning for Scene Text Images [32.21085469233465]
We propose a Local Explicit and Global Order-aware self-supervised representation learning method for scene text images.
Inspired by the human cognitive process of learning words, we propose three novel pre-text tasks for LEGO to model sequential, semantic, and structural features.
The LEGO recognizer achieves superior or comparable performance compared to state-of-the-art scene text recognition methods on six benchmarks.
arXiv Detail & Related papers (2024-08-04T14:07:14Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.