Related papers: Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting

Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting

URL: http://arxiv.org/abs/2203.03911v1
Date: Tue, 8 Mar 2022 08:10:45 GMT
Title: Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting
Authors: Chuhui Xue, Yu Hao, Shijian Lu, Philip Torr, Song Bai
Abstract summary: This paper presents a weakly supervised pre-training method that can acquire effective scene text representations. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features. Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
Score: 69.77701325270047
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Recently, Vision-Language Pre-training (VLP) techniques have greatly benefited various vision-language tasks by jointly learning visual and textual representations, which intuitively helps in Optical Character Recognition (OCR) tasks due to the rich visual and textual information in scene text images. However, these methods cannot well cope with OCR tasks because of the difficulty in both instance-level text encoding and image-text pair acquisition (i.e. images and captured texts in them). This paper presents a weakly supervised pre-training method that can acquire effective scene text representations by jointly learning and aligning visual and textual information. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features, respectively, as well as a visual-textual decoder that models the interaction among textual and visual features for learning effective scene text representations. With the learning of textual features, the pre-trained model can attend texts in images well with character awareness. Besides, these designs enable the learning from weakly annotated texts (i.e. partial texts in images without text bounding boxes) which mitigates the data annotation constraint greatly. Experiments over the weakly annotated images in ICDAR2019-LSVT show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks, respectively. In addition, the proposed method outperforms existing pre-training techniques consistently across multiple public datasets (e.g., +3.2% and +1.3% for Total-Text and CTW1500).

Related papers

Visual Text Processing: A Comprehensive Review and Unified Evaluation [99.57846940547171]
We present a comprehensive, multi-perspective analysis of recent advancements in visual text processing. Our aim is to establish this work as a fundamental resource that fosters future exploration and innovation in the dynamic field of visual text processing.
arXiv Detail & Related papers (2025-04-30T14:19:29Z)
InstructOCR: Instruction Boosting Scene Text Spotting [10.724187109801251]
InstructOCR is an innovative instruction-based scene text spotting model. Our framework employs both text and image encoders during training and inference. We achieve state-of-the-art results on widely used benchmarks.
arXiv Detail & Related papers (2024-12-20T03:23:26Z)
Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. We introduce a novel method named Decoder Pre-training with only text for STR (DPTR) DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z)
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark. FineMatch focuses on text and image mismatch detection and correction. We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z)
ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting [8.397246652127793]
We propose a new pre-training method called OCR-Text Destylization Modeling (ODM) ODM transfers diverse styles of text found in images to a uniform style based on the text prompt. Our method significantly improves performance and outperforms current pre-training methods in scene text detection and spotting tasks.
arXiv Detail & Related papers (2024-03-01T06:13:53Z)
Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs [96.54224331778195]
We present a text-grounding document understanding model, termed TGDoc, which enhances MLLMs with the ability to discern the spatial positioning of text within images. We formulate instruction tuning tasks including text detection, recognition, and spotting to facilitate the cohesive alignment between the visual encoder and large language model. Our method achieves state-of-the-art performance across multiple text-rich benchmarks, validating the effectiveness of our method.
arXiv Detail & Related papers (2023-11-22T06:46:37Z)
Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z)
PreSTU: Pre-Training for Scene-Text Understanding [49.288302725486226]
We propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU) PreSTU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and connect it to the rest of the image content. We empirically demonstrate the effectiveness of this pre-training approach on eight visual question answering and four image captioning benchmarks.
arXiv Detail & Related papers (2022-09-12T18:29:55Z)
Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection. We propose to learn contextualized, joint representations through vision-language pre-training. The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z)
Primitive Representation Learning for Scene Text Recognition [7.818765015637802]
We propose a primitive representation learning method that aims to exploit intrinsic representations of scene text images. A Primitive REpresentation learning Network (PREN) is constructed to use the visual text representations for parallel decoding. We also propose a framework called PREN2D to alleviate the misalignment problem in attention-based methods.
arXiv Detail & Related papers (2021-05-10T11:54:49Z)
Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval. We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.