Related papers: Improving Handwritten Text Recognition via 3D Attention and Multi-Scale Training

Improving Handwritten Text Recognition via 3D Attention and Multi-Scale Training

URL: http://arxiv.org/abs/2410.18374v3
Date: Sat, 02 Aug 2025 03:07:00 GMT
Title: Improving Handwritten Text Recognition via 3D Attention and Multi-Scale Training
Authors: Zi-Rui Wang,
Abstract summary: We propose a new recognition network by using a novel three-dimensional (3D) attention module and global-local context information.<n>Main canonical neural units including attention mechanisms, fully-connected layers, recurrent units and convolutional layers are efficiently organized into a network.
Score: 19.05500901000957
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The segmentation-free research efforts for addressing handwritten text recognition can be divided into three categories: connectionist temporal classification (CTC), hidden Markov model and encoder-decoder methods. In this paper, inspired by the above three modeling methods, we propose a new recognition network by using a novel three-dimensional (3D) attention module and global-local context information. Based on the feature maps of the last convolutional layer, a series of 3D blocks with different resolutions are split. Then, these 3D blocks are fed into the 3D attention module to generate sequential visual features. Finally, by fusing the visual features and the corresponding global-local context features, a well-designed representation can be obtained. Main canonical neural units including attention mechanisms, fully-connected layers, recurrent units and convolutional layers are efficiently organized into a network and can be jointly trained by the CTC loss and the cross-entropy loss. Experiments on the latest Chinese handwritten text datasets (the SCUT-HCCDoc and the SCUT-EPT) and one English handwritten text dataset (the IAM) show that the proposed method can achieve comparable results with the state-of-the-art methods. The code is available at https://github.com/Wukong90/3DAttention-MultiScaleTraining-for-HTR.

Related papers

Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge [45.19482892758984]
Affordance segmentation aims to parse 3D objects into functionally distinct parts, bridging recognition and interaction for applications in robotic manipulation, embodied AI, and AR.<n>We introduce Cross-Modal Affinity Transfer (CMAT), a pre-training strategy that aligns a 3D encoder with lifted 2D semantics and jointly optimize reconstruction, affinity, and diversity to yield semantically organized representations.<n>We further design the Cross-modal Affordance Transformer (CAST), which integrates multi-modal prompts with CMAT-pretrained features to generate precise, prompt-aware segmentation maps.
arXiv Detail & Related papers (2025-10-09T15:01:26Z)
ASIA: Adaptive 3D Segmentation using Few Image Annotations [21.733451438390755]
ASIA (Adaptive 3D using few Image s) is a novel framework that enables segmentation of possibly non-semantic and non-text-describable "parts" in 3D.<n>Our method leverages the rich priors of text-to-image diffusion models, such as Stable Diffusion (SD), to transfer segmentations from image space to 3D.<n>ASIA provides a practical and generalizable solution for both semantic and non-semantic 3D segmentation tasks.
arXiv Detail & Related papers (2025-09-29T05:04:11Z)
Hierarchical Neural Semantic Representation for 3D Semantic Correspondence [72.8101601086805]
We design the hierarchical neural semantic representation (HNSR), which consists of a global semantic feature to capture high-level structure and multi-resolution local geometric features.<n>Second, we design a progressive global-to-local matching strategy, which establishes coarse semantic correspondence using the global semantic feature.<n>Third, our framework is training-free and broadly compatible with various pre-trained 3D generative backbones, demonstrating strong generalization across diverse shape categories.
arXiv Detail & Related papers (2025-09-22T07:23:07Z)
PGOV3D: Open-Vocabulary 3D Semantic Segmentation with Partial-to-Global Curriculum [20.206273757144547]
PGOV3D is a novel framework that introduces a Partial-to-Global curriculum for improving open-vocabulary 3D semantic segmentation.<n>We pre-train the model on partial scenes that provide dense semantic information but relatively simple geometry.<n>In the second stage, we fine-tune the model on complete scene-level point clouds, which are sparser and structurally more complex.
arXiv Detail & Related papers (2025-06-30T08:13:07Z)
3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance [68.8825501902835]
3DSS-VLG is a weakly supervised approach for 3D Semantic with 2D Vision-Language Guidance. To the best of our knowledge, this is the first work to investigate 3D weakly supervised semantic segmentation by using the textual semantic information of text category labels.
arXiv Detail & Related papers (2024-07-13T09:39:11Z)
Segment Any 3D Object with Language [58.471327490684295]
We introduce Segment any 3D Object with LanguagE (SOLE), a semantic geometric and-aware visual-language learning framework with strong generalizability. Specifically, we propose a multimodal fusion network to incorporate multimodal semantics in both backbone and decoder. Our SOLE outperforms previous methods by a large margin on ScanNetv2, ScanNet200, and Replica benchmarks.
arXiv Detail & Related papers (2024-04-02T17:59:10Z)
Representing 3D sparse map points and lines for camera relocalization [1.2974519529978974]
We show how a lightweight neural network can learn to represent both 3D point and line features. In tests, our method secures a significant lead, marking the most considerable enhancement over state-of-the-art learning-based methodologies.
arXiv Detail & Related papers (2024-02-28T03:07:05Z)
VolumeDiffusion: Flexible Text-to-3D Generation with Efficient Volumetric Encoder [56.59814904526965]
This paper introduces a pioneering 3D encoder designed for text-to-3D generation. A lightweight network is developed to efficiently acquire feature volumes from multi-view images. The 3D volumes are then trained on a diffusion model for text-to-3D generation using a 3D U-Net.
arXiv Detail & Related papers (2023-12-18T18:59:05Z)
TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes [67.5351491691866]
We present a novel framework, dubbed TeMO, to parse multi-object 3D scenes and edit their styles. Our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes.
arXiv Detail & Related papers (2023-12-07T12:10:05Z)
Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy. In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z)
Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z)
SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and Quasi-Planar Segmentation [53.83313235792596]
We present a new methodology for real-time semantic mapping from RGB-D sequences. It combines a 2D neural network and a 3D network based on a SLAM system with 3D occupancy mapping. Our system achieves state-of-the-art semantic mapping quality within 2D-3D networks-based systems.
arXiv Detail & Related papers (2023-06-28T22:36:44Z)
Cross-Dimensional Refined Learning for Real-Time 3D Visual Perception from Monocular Video [2.2299983745857896]
We present a novel real-time capable learning method that jointly perceives a 3D scene's geometry structure and semantic labels. We propose an end-to-end cross-dimensional refinement neural network (CDRNet) to extract both 3D mesh and 3D semantic labeling in real time.
arXiv Detail & Related papers (2023-03-16T11:53:29Z)
ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding [110.07170245531464]
Current 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories. Recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language. We learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities.
arXiv Detail & Related papers (2022-12-10T01:34:47Z)
Continuous Offline Handwriting Recognition using Deep Learning Models [0.0]
Handwritten text recognition is an open problem of great interest in the area of automatic document image analysis. We have proposed a new recognition model based on integrating two types of deep learning architectures: convolutional neural networks (CNN) and sequence-to-sequence (seq2seq) The new proposed model provides competitive results with those obtained with other well-established methodologies.
arXiv Detail & Related papers (2021-12-26T07:31:03Z)
TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text. Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy. We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z)
Improving 3D Object Detection with Channel-wise Transformer [58.668922561622466]
We propose a two-stage 3D object detection framework (CT3D) with minimal hand-crafted design. CT3D simultaneously performs proposal-aware embedding and channel-wise context aggregation. It achieves the AP of 81.77% in the moderate car category on the KITTI test 3D detection benchmark.
arXiv Detail & Related papers (2021-08-23T02:03:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.