HandReader: Advanced Techniques for Efficient Fingerspelling Recognition
- URL: http://arxiv.org/abs/2505.10267v1
- Date: Thu, 15 May 2025 13:18:37 GMT
- Title: HandReader: Advanced Techniques for Efficient Fingerspelling Recognition
- Authors: Pavel Korotaev, Petr Surovtsev, Alexander Kapitanov, Karina Kvanchiani, Aleksandr Nagaev,
- Abstract summary: This paper introduces HandReader, a group of three architectures designed to address the fingerspelling recognition task.<n>HandReader$_RGB$ employs the novel Adaptive Shift-Temporal Module (TSAM) to process RGB features from videos of varying lengths.<n>HandReader$_KP$ is built on the proposed Temporal Pose (TPE) operated on keypoints as tensors.<n>Each HandReader model possesses distinct advantages and achieves state-of-the-art results on the ChicagoFSWild and ChicagoFSWild+ datasets.
- Score: 75.38606213726906
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Fingerspelling is a significant component of Sign Language (SL), allowing the interpretation of proper names, characterized by fast hand movements during signing. Although previous works on fingerspelling recognition have focused on processing the temporal dimension of videos, there remains room for improving the accuracy of these approaches. This paper introduces HandReader, a group of three architectures designed to address the fingerspelling recognition task. HandReader$_{RGB}$ employs the novel Temporal Shift-Adaptive Module (TSAM) to process RGB features from videos of varying lengths while preserving important sequential information. HandReader$_{KP}$ is built on the proposed Temporal Pose Encoder (TPE) operated on keypoints as tensors. Such keypoints composition in a batch allows the encoder to pass them through 2D and 3D convolution layers, utilizing temporal and spatial information and accumulating keypoints coordinates. We also introduce HandReader_RGB+KP - architecture with a joint encoder to benefit from RGB and keypoint modalities. Each HandReader model possesses distinct advantages and achieves state-of-the-art results on the ChicagoFSWild and ChicagoFSWild+ datasets. Moreover, the models demonstrate high performance on the first open dataset for Russian fingerspelling, Znaki, presented in this paper. The Znaki dataset and HandReader pre-trained models are publicly available.
Related papers
- HAND: Hierarchical Attention Network for Multi-Scale Handwritten Document Recognition and Layout Analysis [21.25786478579275]
Handwritten document recognition is one of the most challenging tasks in computer vision.<n>Traditionally, this problem has been approached as two separate tasks, handwritten text recognition and layout analysis.<n>This paper introduces HAND, a novel end-to-end and segmentation-free architecture for simultaneous text recognition and layout analysis tasks.
arXiv Detail & Related papers (2024-12-25T20:36:29Z) - Reconstructing Hands in 3D with Transformers [64.15390309553892]
We present an approach that can reconstruct hands in 3D from monocular input.
Our approach for Hand Mesh Recovery, HaMeR, follows a fully transformer-based architecture and can analyze hands with significantly increased accuracy and robustness compared to previous work.
arXiv Detail & Related papers (2023-12-08T18:59:07Z) - Towards Few-shot Entity Recognition in Document Images: A Graph Neural
Network Approach Robust to Image Manipulation [38.09501948846373]
We introduce the topological adjacency relationship among the tokens, emphasizing their relative position information.
We incorporate these graphs into the pre-trained language model by adding graph neural network layers on top of the language model embeddings.
Experiments on two benchmark datasets show that LAGER significantly outperforms strong baselines under different few-shot settings.
arXiv Detail & Related papers (2023-05-24T07:34:33Z) - Boosting Modern and Historical Handwritten Text Recognition with
Deformable Convolutions [52.250269529057014]
Handwritten Text Recognition (HTR) in free-volution pages is a challenging image understanding task.
We propose to adopt deformable convolutions, which can deform depending on the input at hand and better adapt to the geometric variations of the text.
arXiv Detail & Related papers (2022-08-17T06:55:54Z) - Keypoint Message Passing for Video-based Person Re-Identification [106.41022426556776]
Video-based person re-identification (re-ID) is an important technique in visual surveillance systems which aims to match video snippets of people captured by different cameras.
Existing methods are mostly based on convolutional neural networks (CNNs), whose building blocks either process local neighbor pixels at a time, or, when 3D convolutions are used to model temporal information, suffer from the misalignment problem caused by person movement.
In this paper, we propose to overcome the limitations of normal convolutions with a human-oriented graph method. Specifically, features located at person joint keypoints are extracted and connected as a spatial-temporal graph
arXiv Detail & Related papers (2021-11-16T08:01:16Z) - On-device Real-time Hand Gesture Recognition [1.4658400971135652]
We present an on-device real-time hand gesture recognition (HGR) system, which detects a set of predefined static gestures from a single RGB camera.
We use MediaPipe Hands as the basis of the hand skeleton tracker, improve the keypoint accuracy, and add the estimation of 3D keypoints in a world metric space.
arXiv Detail & Related papers (2021-10-29T18:33:25Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB
Video [76.86512780916827]
We present the first real-time method for motion capture of skeletal pose and 3D surface geometry of hands from a single RGB camera.
In order to address the inherent depth ambiguities in RGB data, we propose a novel multi-task CNN.
We experimentally verify the individual components of our RGB two-hand tracking and 3D reconstruction pipeline.
arXiv Detail & Related papers (2021-06-22T12:53:56Z) - FastHand: Fast Hand Pose Estimation From A Monocular Camera [12.790733588554588]
We propose a fast and accurate framework for hand pose estimation, dubbed as "FastHand"
FastHand offers high accuracy scores while reaching a speed of 25 frames per second on an NVIDIA Jetson TX2 graphics processing unit.
arXiv Detail & Related papers (2021-02-14T04:12:41Z) - Two-hand Global 3D Pose Estimation Using Monocular RGB [0.0]
We tackle the challenging task of estimating global 3D joint locations for both hands via only monocular RGB input images.
We propose a novel multi-stage convolutional neural network based pipeline that accurately segments and locates the hands.
We present the first work that achieves accurate global 3D hand tracking on both hands using RGB-only inputs.
arXiv Detail & Related papers (2020-06-01T23:53:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.