A Transformer Based Handwriting Recognition System Jointly Using Online and Offline Features
- URL: http://arxiv.org/abs/2506.20255v1
- Date: Wed, 25 Jun 2025 08:58:47 GMT
- Title: A Transformer Based Handwriting Recognition System Jointly Using Online and Offline Features
- Authors: Ayush Lodh, Ritabrata Chakraborty, Shivakumara Palaiahnakote, Umapada Pal,
- Abstract summary: We introduce an end-to-end network that performs early fusion of offline images and online stroke data.<n>Our approach achieves state-of-the-art accuracy, exceeding previous bests by up to 1%.
- Score: 8.419663258260671
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We posit that handwriting recognition benefits from complementary cues carried by the rasterized complex glyph and the pen's trajectory, yet most systems exploit only one modality. We introduce an end-to-end network that performs early fusion of offline images and online stroke data within a shared latent space. A patch encoder converts the grayscale crop into fixed-length visual tokens, while a lightweight transformer embeds the $(x, y, \text{pen})$ sequence. Learnable latent queries attend jointly to both token streams, yielding context-enhanced stroke embeddings that are pooled and decoded under a cross-entropy loss objective. Because integration occurs before any high-level classification, temporal cues reinforce each other during representation learning, producing stronger writer independence. Comprehensive experiments on IAMOn-DB and VNOn-DB demonstrate that our approach achieves state-of-the-art accuracy, exceeding previous bests by up to 1\%. Our study also shows adaptation of this pipeline with gesturification on the ISI-Air dataset. Our code can be found here.
Related papers
- AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition [0.0]
Continuously recognizing sign gestures and converting them to glosses plays a key role in bridging the gap between hearing and hearing-impaired communities.<n>We propose AutoSign, an autoregressive decoder-only transformer that directly translates pose sequences to natural language text.<n>By eliminating the multi-stage pipeline, AutoSign achieves substantial improvements on the Isharah-1000 dataset.
arXiv Detail & Related papers (2025-07-26T07:28:33Z) - The Cursive Transformer [0.6138671548064355]
We introduce a novel tokenization scheme that converts pen stroke offsets to polar coordinates, discretizes them into bins, and then turns them into sequences of tokens.<n>With just 3,500 handwritten words and a few simple data augmentations, we are able to train a model that can generate realistic cursive handwriting.
arXiv Detail & Related papers (2025-03-31T03:22:27Z) - "Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z) - A Pure Transformer Pretraining Framework on Text-attributed Graphs [50.833130854272774]
We introduce a feature-centric pretraining perspective by treating graph structure as a prior.
Our framework, Graph Sequence Pretraining with Transformer (GSPT), samples node contexts through random walks.
GSPT can be easily adapted to both node classification and link prediction, demonstrating promising empirical success on various datasets.
arXiv Detail & Related papers (2024-06-19T22:30:08Z) - UIA-ViT: Unsupervised Inconsistency-Aware Method based on Vision
Transformer for Face Forgery Detection [52.91782218300844]
We propose a novel Unsupervised Inconsistency-Aware method based on Vision Transformer, called UIA-ViT.
Due to the self-attention mechanism, the attention map among patch embeddings naturally represents the consistency relation, making the vision Transformer suitable for the consistency representation learning.
arXiv Detail & Related papers (2022-10-23T15:24:47Z) - Unsupervised Structure-Texture Separation Network for Oracle Character
Recognition [70.29024469395608]
Oracle bone script is the earliest-known Chinese writing system of the Shang dynasty and is precious to archeology and philology.
We propose a structure-texture separation network (STSN), which is an end-to-end learning framework for joint disentanglement, transformation, adaptation and recognition.
arXiv Detail & Related papers (2022-05-13T10:27:02Z) - REGTR: End-to-end Point Cloud Correspondences with Transformers [79.52112840465558]
We conjecture that attention mechanisms can replace the role of explicit feature matching and RANSAC.
We propose an end-to-end framework to directly predict the final set of correspondences.
Our approach achieves state-of-the-art performance on 3DMatch and ModelNet benchmarks.
arXiv Detail & Related papers (2022-03-28T06:01:00Z) - SOLD2: Self-supervised Occlusion-aware Line Description and Detection [95.8719432775724]
We introduce the first joint detection and description of line segments in a single deep network.
Our method does not require any annotated line labels and can therefore generalize to any dataset.
We evaluate our approach against previous line detection and description methods on several multi-view datasets.
arXiv Detail & Related papers (2021-04-07T19:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.