Disentangle and Regularize: Sign Language Production with Articulator-Based Disentanglement and Channel-Aware Regularization
- URL: http://arxiv.org/abs/2504.06610v2
- Date: Fri, 20 Jun 2025 21:17:30 GMT
- Title: Disentangle and Regularize: Sign Language Production with Articulator-Based Disentanglement and Channel-Aware Regularization
- Authors: Sumeyye Meryem Tasyurek, Tugce Kiziltepe, Hacer Yalim Keles,
- Abstract summary: We train a pose autoencoder that encodes sign poses into a compact latent space using an articulator-based disentanglement strategy.<n>Next, a non-autoregressive transformer decoder is trained to predict these latent representations from sentence-level text embeddings.<n>Our approach does not rely on gloss supervision or pretrained models, and achieves state-of-the-art results on the PHOENIX14T and CSL-DailyPHOENIX datasets.
- Score: 1.8024397171920885
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this work, we propose DARSLP, a simple gloss-free, transformer-based sign language production (SLP) framework that directly maps spoken-language text to sign pose sequences. We first train a pose autoencoder that encodes sign poses into a compact latent space using an articulator-based disentanglement strategy, where features corresponding to the face, right hand, left hand, and body are modeled separately to promote structured and interpretable representation learning. Next, a non-autoregressive transformer decoder is trained to predict these latent representations from sentence-level text embeddings. To guide this process, we apply channel-aware regularization by aligning predicted latent distributions with priors extracted from the ground-truth encodings using a KL-divergence loss. The contribution of each channel to the loss is weighted according to its associated articulator region, enabling the model to account for the relative importance of different articulators during training. Our approach does not rely on gloss supervision or pretrained models, and achieves state-of-the-art results on the PHOENIX14T and CSL-Daily datasets.
Related papers
- AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition [0.0]
Continuously recognizing sign gestures and converting them to glosses plays a key role in bridging the gap between hearing and hearing-impaired communities.<n>We propose AutoSign, an autoregressive decoder-only transformer that directly translates pose sequences to natural language text.<n>By eliminating the multi-stage pipeline, AutoSign achieves substantial improvements on the Isharah-1000 dataset.
arXiv Detail & Related papers (2025-07-26T07:28:33Z) - StgcDiff: Spatial-Temporal Graph Condition Diffusion for Sign Language Transition Generation [33.695308849489784]
We propose StgcDiff, a graph-based conditional diffusion framework that generates smooth transitions between discrete signs.<n>Specifically, we train an encoder-decoder architecture to learn a structure-aware representation of spatial-temporal skeleton.<n>We design the Sign-GCN module as the key component in our framework, which effectively models the spatial-temporal features.
arXiv Detail & Related papers (2025-06-16T07:09:51Z) - SignRep: Enhancing Self-Supervised Sign Representations [30.008980708977095]
Sign language representation learning presents unique challenges due to the complex-temporal nature of signs and the scarcity of labeled datasets.<n>We introduce a scalable, self-supervised framework for sign representation learning.<n>Our model does not require skeletal keypoints during inference, avoiding the limitations of key-point-based models during downstream tasks.<n>It excels in sign dictionary retrieval and sign translation, surpassing standard MAE pre-training and skeletal-based representations in retrieval.
arXiv Detail & Related papers (2025-03-11T15:20:01Z) - Deep Understanding of Sign Language for Sign to Subtitle Alignment [13.96216152723074]
We leverage grammatical rules of British Sign Language to pre-process the input subtitles.
We design a selective alignment loss to optimise the model for predicting the temporal location of signs.
We conduct self-training with refined pseudo-labels which are more accurate than the audio-aligned labels.
arXiv Detail & Related papers (2025-03-05T09:13:40Z) - SignAttention: On the Interpretability of Transformer Models for Sign Language Translation [2.079808290618441]
This paper presents the first comprehensive interpretability analysis of a Transformer-based Sign Language Translation model.
We examine the attention mechanisms within the model to understand how it processes and aligns visual input with sequential glosses.
This work contributes to a deeper understanding of SLT models, paving the way for the development of more transparent and reliable translation systems.
arXiv Detail & Related papers (2024-10-18T14:38:37Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - A Transformer Model for Boundary Detection in Continuous Sign Language [55.05986614979846]
The Transformer model is employed for both Isolated Sign Language Recognition and Continuous Sign Language Recognition.
The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched.
The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos.
arXiv Detail & Related papers (2024-02-22T17:25:01Z) - FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services.
Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality.
Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality.
We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z) - SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign
Language Understanding [132.78015553111234]
Hand gesture serves as a crucial role during the expression of sign language.
Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource.
We propose the first self-supervised pre-trainable SignBERT+ framework with model-aware hand prior incorporated.
arXiv Detail & Related papers (2023-05-08T17:16:38Z) - BEST: BERT Pre-Training for Sign Language Recognition with Coupling
Tokenization [135.73436686653315]
We are dedicated to leveraging the BERT pre-training success and modeling the domain-specific statistics to fertilize the sign language recognition( SLR) model.
Considering the dominance of hand and body in sign language expression, we organize them as pose triplet units and feed them into the Transformer backbone.
Pre-training is performed via reconstructing the masked triplet unit from the corrupted input sequence.
It adaptively extracts the discrete pseudo label from the pose triplet unit, which represents the semantic gesture/body state.
arXiv Detail & Related papers (2023-02-10T06:23:44Z) - Sketch-Guided Text-to-Image Diffusion Models [57.12095262189362]
We introduce a universal approach to guide a pretrained text-to-image diffusion model.
Our method does not require to train a dedicated model or a specialized encoder for the task.
We take a particular focus on the sketch-to-image translation task, revealing a robust and expressive way to generate images.
arXiv Detail & Related papers (2022-11-24T18:45:32Z) - SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign
Language Recognition [94.30084702921529]
Hand gesture serves as a critical role in sign language.
Current deep-learning-based sign language recognition methods may suffer insufficient interpretability.
We introduce the first self-supervised pre-trainable SignBERT with incorporated hand prior for SLR.
arXiv Detail & Related papers (2021-10-11T16:18:09Z) - Pose-based Sign Language Recognition using GCN and BERT [0.0]
Word-level sign language recognition (WSLR) is the first important step towards understanding and interpreting sign language.
recognizing signs from videos is a challenging task as the meaning of a word depends on a combination of subtle body motions, hand configurations, and other movements.
Recent pose-based architectures for W SLR either model both the spatial and temporal dependencies among the poses in different frames simultaneously or only model the temporal information without fully utilizing the spatial information.
We tackle the problem of W SLR using a novel pose-based approach, which captures spatial and temporal information separately and performs late fusion.
arXiv Detail & Related papers (2020-12-01T19:10:50Z) - SLM: Learning a Discourse Language Representation with Sentence
Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation.
We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z) - Transferring Cross-domain Knowledge for Video Sign Language Recognition [103.9216648495958]
Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation.
We propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them.
arXiv Detail & Related papers (2020-03-08T03:05:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.