Related papers: HierCode: A Lightweight Hierarchical Codebook for Zero-shot Chinese Text Recognition

HierCode: A Lightweight Hierarchical Codebook for Zero-shot Chinese Text Recognition

URL: http://arxiv.org/abs/2403.13761v1
Date: Wed, 20 Mar 2024 17:20:48 GMT
Title: HierCode: A Lightweight Hierarchical Codebook for Zero-shot Chinese Text Recognition
Authors: Yuyi Zhang, Yuanzhi Zhu, Dezhi Peng, Peirong Zhang, Zhenhua Yang, Zhibo Yang, Cong Yao, Lianwen Jin,
Abstract summary: We propose HierCode, a novel and lightweight codebook that exploits the innate hierarchical nature of Chinese characters. HierCode employs a multi-hot encoding strategy, leveraging hierarchical binary tree encoding and prototype learning to create distinctive, informative representations for each character. This approach not only facilitates zero-shot recognition of OOV characters by utilizing shared radicals and structures but also excels in line-level recognition tasks by computing similarity with visual features.
Score: 47.86479271322264
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text recognition, especially for complex scripts like Chinese, faces unique challenges due to its intricate character structures and vast vocabulary. Traditional one-hot encoding methods struggle with the representation of hierarchical radicals, recognition of Out-Of-Vocabulary (OOV) characters, and on-device deployment due to their computational intensity. To address these challenges, we propose HierCode, a novel and lightweight codebook that exploits the innate hierarchical nature of Chinese characters. HierCode employs a multi-hot encoding strategy, leveraging hierarchical binary tree encoding and prototype learning to create distinctive, informative representations for each character. This approach not only facilitates zero-shot recognition of OOV characters by utilizing shared radicals and structures but also excels in line-level recognition tasks by computing similarity with visual features, a notable advantage over existing methods. Extensive experiments across diverse benchmarks, including handwritten, scene, document, web, and ancient text, have showcased HierCode's superiority for both conventional and zero-shot Chinese character or text recognition, exhibiting state-of-the-art performance with significantly fewer parameters and fast inference speed.

Related papers

Zero-Shot Chinese Character Recognition with Hierarchical Multi-Granularity Image-Text Aligning [52.92837273570818]
Chinese characters exhibit unique structures and compositional rules, allowing for the use of fine-grained semantic information in representation.<n>We propose a Hierarchical Multi-Granularity Image-Text Aligning (Hi-GITA) framework based on a contrastive paradigm.<n>Our proposed Hi-GITA outperforms existing zero-shot CCR methods.
arXiv Detail & Related papers (2025-05-30T17:39:14Z)
Towards Improved Text-Aligned Codebook Learning: Multi-Hierarchical Codebook-Text Alignment with Long Text [17.35793995814643]
We propose a novel Text-Augmented Codebook Learning framework, named TA-VQ. It generates longer text for each image using the visual-language model for improved text-aligned codebook learning. To tackle two challenges, we propose to split the long text into multiple granularities for encoding, i.e., word, phrase, and sentence.
arXiv Detail & Related papers (2025-03-03T07:38:18Z)
Descriminative-Generative Custom Tokens for Vision-Language Models [101.40245125955306]
This paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs) Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well with words to form new input queries.
arXiv Detail & Related papers (2025-02-17T18:13:42Z)
Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs. We propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs.
arXiv Detail & Related papers (2024-11-26T18:28:09Z)
Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through Image-IDS Aligning [61.34060587461462]
We propose a two-stage framework for Chinese Text Recognition (CTR) We pre-train a CLIP-like model through aligning printed character images and Ideographic Description Sequences (IDS) This pre-training stage simulates humans recognizing Chinese characters and obtains the canonical representation of each character. The learned representations are employed to supervise the CTR model, such that traditional single-character recognition can be improved to text-line recognition.
arXiv Detail & Related papers (2023-09-03T05:33:16Z)
DTrOCR: Decoder-only Transformer for Optical Character Recognition [0.0]
We propose a simpler and more effective method for text recognition, known as the Decoder-only Transformer for Optical Character Recognition (DTrOCR) This method uses a decoder-only Transformer to take advantage of a generative language model that is pre-trained on a large corpus. Our experiments demonstrated that DTrOCR outperforms current state-of-the-art methods by a large margin in the recognition of printed, handwritten, and scene text in both English and Chinese.
arXiv Detail & Related papers (2023-08-30T12:37:03Z)
Learning Generative Structure Prior for Blind Text Image Super-resolution [153.05759524358467]
We present a novel prior that focuses more on the character structure. To restrict the generative space of StyleGAN, we store the discrete features for each character in a codebook. The proposed structure prior exerts stronger character-specific guidance to restore faithful and precise strokes of a designated character.
arXiv Detail & Related papers (2023-03-26T13:54:28Z)
Enhancing Indic Handwritten Text Recognition Using Global Semantic Information [36.01828106385858]
We use a semantic module in an encoder-decoder framework for extracting global semantic information to recognize the Indic handwritten texts. The proposed framework achieves state-of-the-art results on handwritten texts of ten Indic languages.
arXiv Detail & Related papers (2022-12-15T12:53:26Z)
MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework. We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images. We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z)
Dual Encoding for Video Retrieval by Text [49.34356217787656]
We propose a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Our novelty is two-fold. First, different from prior art that resorts to a specific single-level encoder, the proposed network performs multi-level encoding. Second, different from a conventional common space learning algorithm which is either concept based or latent space based, we introduce hybrid space learning.
arXiv Detail & Related papers (2020-09-10T15:49:39Z)
Neural Computing for Online Arabic Handwriting Character Recognition using Hard Stroke Features Mining [0.0]
An enhanced method of detecting the desired critical points from vertical and horizontal direction-length of handwriting stroke features of online Arabic script recognition is proposed. A minimum feature set is extracted from these tokens for classification of characters using a multilayer perceptron with a back-propagation learning algorithm and modified sigmoid function-based activation function. The proposed method achieves an average accuracy of 98.6% comparable in state of art character recognition techniques.
arXiv Detail & Related papers (2020-05-02T23:17:08Z)
Separating Content from Style Using Adversarial Learning for Recognizing Text in the Wild [103.51604161298512]
We propose an adversarial learning framework for the generation and recognition of multiple characters in an image. Our framework can be integrated into recent recognition methods to achieve new state-of-the-art recognition accuracy.
arXiv Detail & Related papers (2020-01-13T12:41:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.