Siformer: Feature-isolated Transformer for Efficient Skeleton-based Sign Language Recognition
- URL: http://arxiv.org/abs/2503.20436v1
- Date: Wed, 26 Mar 2025 11:10:29 GMT
- Title: Siformer: Feature-isolated Transformer for Efficient Skeleton-based Sign Language Recognition
- Authors: Muxin Pu, Mei Kuan Lim, Chun Yong Chong,
- Abstract summary: Sign language recognition refers to interpreting sign language glosses from given videos automatically.<n>Recent skeleton-based action recognition has attracted increasing attention due to its ability to handle variations in subjects and backgrounds independently.
- Score: 4.353165013945741
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sign language recognition (SLR) refers to interpreting sign language glosses from given videos automatically. This research area presents a complex challenge in computer vision because of the rapid and intricate movements inherent in sign languages, which encompass hand gestures, body postures, and even facial expressions. Recently, skeleton-based action recognition has attracted increasing attention due to its ability to handle variations in subjects and backgrounds independently. However, current skeleton-based SLR methods exhibit three limitations: 1) they often neglect the importance of realistic hand poses, where most studies train SLR models on non-realistic skeletal representations; 2) they tend to assume complete data availability in both training or inference phases, and capture intricate relationships among different body parts collectively; 3) these methods treat all sign glosses uniformly, failing to account for differences in complexity levels regarding skeletal representations. To enhance the realism of hand skeletal representations, we present a kinematic hand pose rectification method for enforcing constraints. Mitigating the impact of missing data, we propose a feature-isolated mechanism to focus on capturing local spatial-temporal context. This method captures the context concurrently and independently from individual features, thus enhancing the robustness of the SLR model. Additionally, to adapt to varying complexity levels of sign glosses, we develop an input-adaptive inference approach to optimise computational efficiency and accuracy. Experimental results demonstrate the effectiveness of our approach, as evidenced by achieving a new state-of-the-art (SOTA) performance on WLASL100 and LSA64. For WLASL100, we achieve a top-1 accuracy of 86.50\%, marking a relative improvement of 2.39% over the previous SOTA. For LSA64, we achieve a top-1 accuracy of 99.84%.
Related papers
- SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models [80.90808879991182]
Drawing on systematic analysis, we uncover two fundamental limitations of foundation models for sketch understanding.<n>We address these limitations by strategically combining SD with CLIP, whose strong semantic understanding naturally compensates for SD's spatial-frequency biases.<n>By dynamically injecting CLIP features into SD's denoising process and adaptively aggregating features across semantic levels, our method achieves state-of-the-art performance in sketch retrieval.
arXiv Detail & Related papers (2025-03-18T10:47:46Z) - Training Strategies for Isolated Sign Language Recognition [72.27323884094953]
This paper introduces a comprehensive model training pipeline for Isolated Sign Language Recognition.<n>The constructed pipeline incorporates carefully selected image and video augmentations to tackle the challenges of low data quality and varying sign speeds.<n>We achieve a state-of-the-art result on the WLASL and Slovo benchmarks with 1.63% and 14.12% improvements compared to the previous best solution.
arXiv Detail & Related papers (2024-12-16T08:37:58Z) - MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition [94.56755080185732]
We propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information.
Our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation.
arXiv Detail & Related papers (2024-05-31T08:06:05Z) - SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign
Language Understanding [132.78015553111234]
Hand gesture serves as a crucial role during the expression of sign language.
Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource.
We propose the first self-supervised pre-trainable SignBERT+ framework with model-aware hand prior incorporated.
arXiv Detail & Related papers (2023-05-08T17:16:38Z) - Learning to Decompose Visual Features with Latent Textual Prompts [140.2117637223449]
We propose Decomposed Feature Prompting (DeFo) to improve vision-language models.
Our empirical study shows DeFo's significance in improving the vision-language models.
arXiv Detail & Related papers (2022-10-09T15:40:13Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - Pose-based Sign Language Recognition using GCN and BERT [0.0]
Word-level sign language recognition (WSLR) is the first important step towards understanding and interpreting sign language.
recognizing signs from videos is a challenging task as the meaning of a word depends on a combination of subtle body motions, hand configurations, and other movements.
Recent pose-based architectures for W SLR either model both the spatial and temporal dependencies among the poses in different frames simultaneously or only model the temporal information without fully utilizing the spatial information.
We tackle the problem of W SLR using a novel pose-based approach, which captures spatial and temporal information separately and performs late fusion.
arXiv Detail & Related papers (2020-12-01T19:10:50Z) - Selective Spatio-Temporal Aggregation Based Pose Refinement System:
Towards Understanding Human Activities in Real-World Videos [8.571131862820833]
State-of-the-art pose estimators struggle in obtaining high-quality 2D or 3D pose data due to truncation and low-resolution in real-world un-annotated videos.
We propose a Selective Spatio-Temporal Aggregation mechanism, named SST-A, that refines and smooths the keypoint locations extracted by multiple expert pose estimators.
We demonstrate that the skeleton data refined by our Pose-Refinement system (SSTA-PRS) is effective at boosting various existing action recognition models.
arXiv Detail & Related papers (2020-11-10T19:19:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.