Orientation-Independent Chinese Text Recognition in Scene Images
- URL: http://arxiv.org/abs/2309.01081v1
- Date: Sun, 3 Sep 2023 05:30:21 GMT
- Title: Orientation-Independent Chinese Text Recognition in Scene Images
- Authors: Haiyang Yu, Xiaocong Wang, Bin Li, Xiangyang Xue
- Abstract summary: We take the first attempt to extract orientation-independent visual features by disentangling content and orientation information of text images.
Specifically, we introduce a Character Image Reconstruction Network (CIRN) to recover corresponding printed character images with disentangled content and orientation information.
- Score: 61.34060587461462
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene text recognition (STR) has attracted much attention due to its broad
applications. The previous works pay more attention to dealing with the
recognition of Latin text images with complex backgrounds by introducing
language models or other auxiliary networks. Different from Latin texts, many
vertical Chinese texts exist in natural scenes, which brings difficulties to
current state-of-the-art STR methods. In this paper, we take the first attempt
to extract orientation-independent visual features by disentangling content and
orientation information of text images, thus recognizing both horizontal and
vertical texts robustly in natural scenes. Specifically, we introduce a
Character Image Reconstruction Network (CIRN) to recover corresponding printed
character images with disentangled content and orientation information. We
conduct experiments on a scene dataset for benchmarking Chinese text
recognition, and the results demonstrate that the proposed method can indeed
improve performance through disentangling content and orientation information.
To further validate the effectiveness of our method, we additionally collect a
Vertical Chinese Text Recognition (VCTR) dataset. The experimental results show
that the proposed method achieves 45.63% improvement on VCTR when introducing
CIRN to the baseline model.
Related papers
- Text Image Generation for Low-Resource Languages with Dual Translation Learning [0.0]
This study proposes a novel approach that generates text images in low-resource languages by emulating the style of real text images from high-resource languages.
The training of this model involves dual translation tasks, where it transforms plain text images into either synthetic or real text images.
To enhance the accuracy and variety of generated text images, we introduce two guidance techniques.
arXiv Detail & Related papers (2024-09-26T11:23:59Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through
Image-IDS Aligning [61.34060587461462]
We propose a two-stage framework for Chinese Text Recognition (CTR)
We pre-train a CLIP-like model through aligning printed character images and Ideographic Description Sequences (IDS)
This pre-training stage simulates humans recognizing Chinese characters and obtains the canonical representation of each character.
The learned representations are employed to supervise the CTR model, such that traditional single-character recognition can be improved to text-line recognition.
arXiv Detail & Related papers (2023-09-03T05:33:16Z) - Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions.
We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Benchmarking Chinese Text Recognition: Datasets, Baselines, and an
Empirical Study [25.609450020149637]
Existing text recognition methods are mainly for English texts, whereas ignoring the pivotal role of Chinese texts.
We manually collect Chinese text datasets from publicly available competitions, projects, and papers, then divide them into four categories including scene, web, document, and handwriting datasets.
By analyzing the experimental results, we surprisingly observe that state-of-the-art baselines for recognizing English texts cannot perform well on Chinese scenarios.
arXiv Detail & Related papers (2021-12-30T15:30:52Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.