Entropy-Aware Structural Alignment for Zero-Shot Handwritten Chinese Character Recognition
- URL: http://arxiv.org/abs/2602.03913v2
- Date: Tue, 10 Feb 2026 16:46:50 GMT
- Title: Entropy-Aware Structural Alignment for Zero-Shot Handwritten Chinese Character Recognition
- Authors: Qiuming Luo, Tao Zeng, Feng Li, Heming Liu, Rui Mao, Chang Kong,
- Abstract summary: Zero-shot Handwritten Chinese Character Recognition aims to recognize unseen characters by leveraging radical-based semantic compositions.<n>We propose an Entropy-Aware Structural Alignment Network that bridges the visual-semantic gap through information-theoretic modeling.<n>Our method establishes new state-of-the-art performance, achieving an accuracy of 55.04% on the ICDAR 2013 dataset.
- Score: 7.632962062462334
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Zero-shot Handwritten Chinese Character Recognition (HCCR) aims to recognize unseen characters by leveraging radical-based semantic compositions. However, existing approaches often treat characters as flat radical sequences, neglecting the hierarchical topology and the uneven information density of different components. To address these limitations, we propose an Entropy-Aware Structural Alignment Network that bridges the visual-semantic gap through information-theoretic modeling. First, we introduce an Information Entropy Prior to dynamically modulate positional embeddings via multiplicative interaction, acting as a saliency detector that prioritizes discriminative roots over ubiquitous components. Second, we construct a Dual-View Radical Tree to extract multi-granularity structural features, which are integrated via an adaptive Sigmoid-based gating network to encode both global layout and local spatial roles. Finally, a Top-K Semantic Feature Fusion mechanism is devised to augment the decoding process by utilizing the centroid of semantic neighbors, effectively rectifying visual ambiguities through feature-level consensus. Extensive experiments demonstrate that our method establishes new state-of-the-art performance, achieving an accuracy of 55.04\% on the ICDAR 2013 dataset ($m=1500$), significantly outperforming existing CLIP-based baselines in the challenging zero-shot setting. Furthermore, the framework exhibits exceptional data efficiency, demonstrating rapid adaptability with minimal support samples, achieving 92.41\% accuracy with only one support sample per class.
Related papers
- HAAF: Hierarchical Adaptation and Alignment of Foundation Models for Few-Shot Pathology Anomaly Detection [10.649984141835189]
We propose the Hierarchical Adaptation and Alignment Framework (HAAF)<n>At its core is a novel Cross-Level Scaled Alignment mechanism that enforces a sequential calibration order.<n>A dual-branch inference strategy integrates semantic scores with geometric prototypes to ensure stability in few-shot settings.
arXiv Detail & Related papers (2026-01-24T10:31:21Z) - ResAgent: Entropy-based Prior Point Discovery and Visual Reasoning for Referring Expression Segmentation [21.87321809019825]
Referring Expression (RES) is a core vision-language segmentation task that enables pixel-level understanding of targets via free-form linguistic expressions.<n>textbfmodel is a novel RES framework integrating textbfEntropy-textbfBased Point textbfDiscovery (textbfEBD) and textbfVision-textbfBased textbfReasoning (textbfVBR)<n>model implements a coarse-to
arXiv Detail & Related papers (2026-01-23T01:56:04Z) - Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion [31.189038928192648]
Co2S is a semi-supervised RS segmentation framework that fuses priors from vision-language models and self-supervised models.<n>An explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries.<n>Experiments on six popular datasets demonstrate the superiority of the proposed method.
arXiv Detail & Related papers (2025-12-28T18:24:19Z) - Knowledge-Informed Neural Network for Complex-Valued SAR Image Recognition [51.03674130115878]
We introduce the Knowledge-Informed Neural Network (KINN), a lightweight framework built upon a novel "compression-aggregation-compression" architecture.<n>KINN establishes a state-of-the-art in parameter-efficient recognition, offering exceptional generalization in data-scarce and out-of-distribution scenarios.
arXiv Detail & Related papers (2025-10-23T07:12:26Z) - Generalizing WiFi Gesture Recognition via Large-Model-Aware Semantic Distillation and Alignment [6.124050993047708]
WiFi-based gesture recognition has emerged as a promising RF sensing paradigm for AIoT environments.<n>We propose a novel generalization framework, termed Large-Model-Aware Semantic Distillation and Alignment.<n>Our method offers a scalable and deployable solution for generalized RF-based gesture interfaces in real-world AIoT applications.
arXiv Detail & Related papers (2025-10-15T10:28:50Z) - Beyond BEV: Optimizing Point-Level Tokens for Collaborative Perception [17.654858416126093]
Collaborative perception allows agents to enhance their perceptual capabilities by exchanging intermediate features.<n>Existing methods typically organize these intermediate features as 2D bird's-eye-view (BEV) representations.<n>We present CoPLOT, a novel Collaborative perception framework that utilizes Point-Level Optimized Tokens.
arXiv Detail & Related papers (2025-08-27T07:27:42Z) - "Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z) - Scale-wise Bidirectional Alignment Network for Referring Remote Sensing Image Segmentation [12.893224628061516]
The goal of referring remote sensing image segmentation (RRSIS) is to extract specific pixel-level regions within an aerial image via a natural language expression.<n>We propose an innovative framework called Scale-wise Bidirectional Alignment Network (SBANet) to address these challenges.<n>Our proposed method achieves superior performance in comparison to previous state-of-the-art methods on the RRSIS-D and RefSegRS datasets.
arXiv Detail & Related papers (2025-01-01T14:24:04Z) - Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z) - Mesh Denoising Transformer [104.5404564075393]
Mesh denoising is aimed at removing noise from input meshes while preserving their feature structures.
SurfaceFormer is a pioneering Transformer-based mesh denoising framework.
New representation known as Local Surface Descriptor captures local geometric intricacies.
Denoising Transformer module receives the multimodal information and achieves efficient global feature aggregation.
arXiv Detail & Related papers (2024-05-10T15:27:43Z) - UGMAE: A Unified Framework for Graph Masked Autoencoders [67.75493040186859]
We propose UGMAE, a unified framework for graph masked autoencoders.
We first develop an adaptive feature mask generator to account for the unique significance of nodes.
We then design a ranking-based structure reconstruction objective joint with feature reconstruction to capture holistic graph information.
arXiv Detail & Related papers (2024-02-12T19:39:26Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning [74.76431541169342]
Zero-shot learning (ZSL) tackles the unseen class recognition problem, transferring semantic knowledge from seen classes to unseen ones.
We propose a novel hierarchical semantic-visual adaptation (HSVA) framework to align semantic and visual domains.
Experiments on four benchmark datasets demonstrate HSVA achieves superior performance on both conventional and generalized ZSL.
arXiv Detail & Related papers (2021-09-30T14:27:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.