Data-Efficient American Sign Language Recognition via Few-Shot Prototypical Networks
- URL: http://arxiv.org/abs/2512.10562v1
- Date: Thu, 11 Dec 2025 11:50:03 GMT
- Title: Data-Efficient American Sign Language Recognition via Few-Shot Prototypical Networks
- Authors: Meher Md Saad,
- Abstract summary: Isolated Sign Language Recognition is critical for bridging the communication gap between the Deaf and Hard-of-Hearing (DHH) community and the hearing world.<n>We propose a Few-Shot Prototypical Network framework adapted for a skeleton based encoder.<n>Our approach utilizes episodic training to learn a semantic metric space where signs are classified based on their proximity to dynamic class prototypes.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Isolated Sign Language Recognition (ISLR) is critical for bridging the communication gap between the Deaf and Hard-of-Hearing (DHH) community and the hearing world. However, robust ISLR is fundamentally constrained by data scarcity and the long-tail distribution of sign vocabulary, where gathering sufficient examples for thousands of unique signs is prohibitively expensive. Standard classification approaches struggle under these conditions, often overfitting to frequent classes while failing to generalize to rare ones. To address this bottleneck, we propose a Few-Shot Prototypical Network framework adapted for a skeleton based encoder. Unlike traditional classifiers that learn fixed decision boundaries, our approach utilizes episodic training to learn a semantic metric space where signs are classified based on their proximity to dynamic class prototypes. We integrate a Spatiotemporal Graph Convolutional Network (ST-GCN) with a novel Multi-Scale Temporal Aggregation (MSTA) module to capture both rapid and fluid motion dynamics. Experimental results on the WLASL dataset demonstrate the superiority of this metric learning paradigm: our model achieves 43.75% Top-1 and 77.10% Top-5 accuracy on the test set. Crucially, this outperforms a standard classification baseline sharing the identical backbone architecture by over 13%, proving that the prototypical training strategy effectively outperforms in a data scarce situation where standard classification fails. Furthermore, the model exhibits strong zero-shot generalization, achieving nearly 30% accuracy on the unseen SignASL dataset without fine-tuning, offering a scalable pathway for recognizing extensive sign vocabularies with limited data.
Related papers
- Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech [51.14752758616364]
Speech-based depression detection (SDD) is a promising, non-invasive alternative to traditional clinical assessments.<n>We propose HAREN-CTC, a novel architecture that integrates multi-layer SSL features using cross-attention within a multitask learning framework.<n>The model achieves state-of-the-art macro F1-scores of 0.81 on DAIC-WOZ and 0.82 on MODMA, outperforming prior methods across both evaluation scenarios.
arXiv Detail & Related papers (2025-10-05T09:32:12Z) - Stack Transformer Based Spatial-Temporal Attention Model for Dynamic Sign Language and Fingerspelling Recognition [1.949837893170278]
Hand gesture-based Sign Language Recognition serves as a crucial bridge between deaf and non-deaf individuals.<n>We propose the Sequential Spatio-Temporal Attention Network (SSTAN), a novel Transformer-based architecture.<n>We validated our model through extensive experiments on diverse, large-scale datasets.
arXiv Detail & Related papers (2025-03-21T04:57:18Z) - SignRep: Enhancing Self-Supervised Sign Representations [30.008980708977095]
Sign language representation learning presents unique challenges due to the complex-temporal nature of signs and the scarcity of labeled datasets.<n>We introduce a scalable, self-supervised framework for sign representation learning.<n>Our model does not require skeletal keypoints during inference, avoiding the limitations of key-point-based models during downstream tasks.<n>It excels in sign dictionary retrieval and sign translation, surpassing standard MAE pre-training and skeletal-based representations in retrieval.
arXiv Detail & Related papers (2025-03-11T15:20:01Z) - CGMatch: A Different Perspective of Semi-supervised Learning [20.03126368452921]
Semi-supervised learning (SSL) has garnered significant attention due to its ability to leverage limited labeled data.<n>We argue that existing methods rely solely on the model's confidence to accurately assess the model's state.<n>We propose a novel SSL model called CGMatch, which, for the first time, incorporates a new metric known as Count-Gap.
arXiv Detail & Related papers (2025-03-04T03:14:15Z) - Self-supervised Learning for Acoustic Few-Shot Classification [10.180992026994739]
We introduce and evaluate a new architecture that combines CNN-based preprocessing with feature extraction based on state space models (SSMs)<n>We pre-train this architecture using contrastive learning on the actual task data and subsequent fine-tuning with an extremely small amount of labelled data.<n>Our evaluation shows that it outperforms state-of-the-art architectures on the few-shot classification problem.
arXiv Detail & Related papers (2024-09-15T07:45:11Z) - Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification [51.35500308126506]
Self-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels.
We study how classification-based evaluation protocols for SSL correlate and how well they predict downstream performance on different dataset types.
arXiv Detail & Related papers (2024-07-16T23:17:36Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - No Fear of Heterogeneity: Classifier Calibration for Federated Learning
with Non-IID Data [78.69828864672978]
A central challenge in training classification models in the real-world federated system is learning with non-IID data.
We propose a novel and simple algorithm called Virtual Representations (CCVR), which adjusts the classifier using virtual representations sampled from an approximated ssian mixture model.
Experimental results demonstrate that CCVR state-of-the-art performance on popular federated learning benchmarks including CIFAR-10, CIFAR-100, and CINIC-10.
arXiv Detail & Related papers (2021-06-09T12:02:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.