SignRep: Enhancing Self-Supervised Sign Representations
- URL: http://arxiv.org/abs/2503.08529v1
- Date: Tue, 11 Mar 2025 15:20:01 GMT
- Title: SignRep: Enhancing Self-Supervised Sign Representations
- Authors: Ryan Wong, Necati Cihan Camgoz, Richard Bowden,
- Abstract summary: Sign language representation learning presents unique challenges due to the complex-temporal nature of signs and the scarcity of labeled datasets.<n>We introduce a scalable, self-supervised framework for sign representation learning.<n>Our model does not require skeletal keypoints during inference, avoiding the limitations of key-point-based models during downstream tasks.<n>It excels in sign dictionary retrieval and sign translation, surpassing standard MAE pre-training and skeletal-based representations in retrieval.
- Score: 30.008980708977095
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Sign language representation learning presents unique challenges due to the complex spatio-temporal nature of signs and the scarcity of labeled datasets. Existing methods often rely either on models pre-trained on general visual tasks, that lack sign-specific features, or use complex multimodal and multi-branch architectures. To bridge this gap, we introduce a scalable, self-supervised framework for sign representation learning. We leverage important inductive (sign) priors during the training of our RGB model. To do this, we leverage simple but important cues based on skeletons while pretraining a masked autoencoder. These sign specific priors alongside feature regularization and an adversarial style agnostic loss provide a powerful backbone. Notably, our model does not require skeletal keypoints during inference, avoiding the limitations of keypoint-based models during downstream tasks. When finetuned, we achieve state-of-the-art performance for sign recognition on the WLASL, ASL-Citizen and NMFs-CSL datasets, using a simpler architecture and with only a single-modality. Beyond recognition, our frozen model excels in sign dictionary retrieval and sign translation, surpassing standard MAE pretraining and skeletal-based representations in retrieval. It also reduces computational costs for training existing sign translation models while maintaining strong performance on Phoenix2014T, CSL-Daily and How2Sign.
Related papers
- Uni-Sign: Toward Unified Sign Language Understanding at Scale [90.76641997060513]
We propose a unified pre-training framework that eliminates the gap between pre-training and downstream SLU tasks.<n>Uni-Sign achieves state-of-the-art performance across multiple downstream SLU tasks.
arXiv Detail & Related papers (2025-01-25T11:51:23Z) - Improving Continuous Sign Language Recognition with Adapted Image Models [9.366498095041814]
Large-scale vision-language models (e.g., CLIP) have shown impressive generalization performance over a series of downstream tasks.
To enable high efficiency when adapting these large vision-language models to performing continuous sign language recognition, we propose a novel strategy (AdaptSign)
AdaptSign is able to demonstrate superior performance across a series of CSLR benchmarks including PHOENIX14, PHOENIX14-T, CSL-Daily and CSL compared to existing methods.
arXiv Detail & Related papers (2024-04-12T03:43:37Z) - Improving Input-label Mapping with Demonstration Replay for In-context
Learning [67.57288926736923]
In-context learning (ICL) is an emerging capability of large autoregressive language models.
We propose a novel ICL method called Sliding Causal Attention (RdSca)
We show that our method significantly improves the input-label mapping in ICL demonstrations.
arXiv Detail & Related papers (2023-10-30T14:29:41Z) - Towards the extraction of robust sign embeddings for low resource sign
language recognition [7.969704867355098]
We show that keypoint-based embeddings can transfer between sign languages and achieve competitive performance.
We furthermore achieve better performance using fine-tuned transferred embeddings than models trained only on the target sign language.
arXiv Detail & Related papers (2023-06-30T11:21:40Z) - SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign
Language Understanding [132.78015553111234]
Hand gesture serves as a crucial role during the expression of sign language.
Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource.
We propose the first self-supervised pre-trainable SignBERT+ framework with model-aware hand prior incorporated.
arXiv Detail & Related papers (2023-05-08T17:16:38Z) - Self-Sufficient Framework for Continuous Sign Language Recognition [75.60327502570242]
The goal of this work is to develop self-sufficient framework for Continuous Sign Language Recognition.
These include the need for complex multi-scale features such as hands, face, and mouth for understanding, and absence of frame-level annotations.
We propose Divide and Focus Convolution (DFConv) which extracts both manual and non-manual features without the need for additional networks or annotations.
DPLR propagates non-spiky frame-level pseudo-labels by combining the ground truth gloss sequence labels with the predicted sequence.
arXiv Detail & Related papers (2023-03-21T11:42:57Z) - Improving Continuous Sign Language Recognition with Consistency
Constraints and Signer Removal [24.537234147678113]
We propose three auxiliary tasks to enhance the CSLR backbones.
A keypoint-guided spatial attention module is developed to enforce the visual module.
A sentence embedding consistency constraint is imposed between the visual and sequential modules.
Our model achieves state-of-the-art or competitive performance on five benchmarks.
arXiv Detail & Related papers (2022-12-26T06:38:34Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign
Language Recognition [94.30084702921529]
Hand gesture serves as a critical role in sign language.
Current deep-learning-based sign language recognition methods may suffer insufficient interpretability.
We introduce the first self-supervised pre-trainable SignBERT with incorporated hand prior for SLR.
arXiv Detail & Related papers (2021-10-11T16:18:09Z) - Multi-Modal Zero-Shot Sign Language Recognition [51.07720650677784]
We propose a multi-modal Zero-Shot Sign Language Recognition model.
A Transformer-based model along with a C3D model is used for hand detection and deep features extraction.
A semantic space is used to map the visual features to the lingual embedding of the class labels.
arXiv Detail & Related papers (2021-09-02T09:10:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.