Disentangle and Regularize: Sign Language Production with Articulator-Based Disentanglement and Channel-Aware Regularization
- URL: http://arxiv.org/abs/2504.06610v1
- Date: Wed, 09 Apr 2025 06:14:19 GMT
- Title: Disentangle and Regularize: Sign Language Production with Articulator-Based Disentanglement and Channel-Aware Regularization
- Authors: Sumeyye Meryem Tasyurek, Tugce Kiziltepe, Hacer Yalim Keles,
- Abstract summary: We train a pose autoencoder that encodes sign poses into a compact latent space using an articulator-based disentanglement strategy.<n>A non-autoregressive transformer decoder is trained to predict latent representations from sentence-level text embeddings.
- Score: 1.8024397171920885
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this work, we propose a simple gloss-free, transformer-based sign language production (SLP) framework that directly maps spoken-language text to sign pose sequences. We first train a pose autoencoder that encodes sign poses into a compact latent space using an articulator-based disentanglement strategy, where features corresponding to the face, right hand, left hand, and body are modeled separately to promote structured and interpretable representation learning. Next, a non-autoregressive transformer decoder is trained to predict these latent representations from sentence-level text embeddings. To guide this process, we apply channel-aware regularization by aligning predicted latent distributions with priors extracted from the ground-truth encodings using a KL-divergence loss. The contribution of each channel to the loss is weighted according to its associated articulator region, enabling the model to account for the relative importance of different articulators during training. Our approach does not rely on gloss supervision or pretrained models, and achieves state-of-the-art results on the PHOENIX14T dataset using only a modest training set.
Related papers
- SignRep: Enhancing Self-Supervised Sign Representations [30.008980708977095]
Sign language representation learning presents unique challenges due to the complex-temporal nature of signs and the scarcity of labeled datasets.<n>We introduce a scalable, self-supervised framework for sign representation learning.<n>Our model does not require skeletal keypoints during inference, avoiding the limitations of key-point-based models during downstream tasks.<n>It excels in sign dictionary retrieval and sign translation, surpassing standard MAE pre-training and skeletal-based representations in retrieval.
arXiv Detail & Related papers (2025-03-11T15:20:01Z) - Deep Understanding of Sign Language for Sign to Subtitle Alignment [13.96216152723074]
We leverage grammatical rules of British Sign Language to pre-process the input subtitles.
We design a selective alignment loss to optimise the model for predicting the temporal location of signs.
We conduct self-training with refined pseudo-labels which are more accurate than the audio-aligned labels.
arXiv Detail & Related papers (2025-03-05T09:13:40Z) - SignAttention: On the Interpretability of Transformer Models for Sign Language Translation [2.079808290618441]
This paper presents the first comprehensive interpretability analysis of a Transformer-based Sign Language Translation model.
We examine the attention mechanisms within the model to understand how it processes and aligns visual input with sequential glosses.
This work contributes to a deeper understanding of SLT models, paving the way for the development of more transparent and reliable translation systems.
arXiv Detail & Related papers (2024-10-18T14:38:37Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - A Transformer Model for Boundary Detection in Continuous Sign Language [55.05986614979846]
The Transformer model is employed for both Isolated Sign Language Recognition and Continuous Sign Language Recognition.
The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched.
The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos.
arXiv Detail & Related papers (2024-02-22T17:25:01Z) - SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign
Language Understanding [132.78015553111234]
Hand gesture serves as a crucial role during the expression of sign language.
Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource.
We propose the first self-supervised pre-trainable SignBERT+ framework with model-aware hand prior incorporated.
arXiv Detail & Related papers (2023-05-08T17:16:38Z) - BEST: BERT Pre-Training for Sign Language Recognition with Coupling
Tokenization [135.73436686653315]
We are dedicated to leveraging the BERT pre-training success and modeling the domain-specific statistics to fertilize the sign language recognition( SLR) model.
Considering the dominance of hand and body in sign language expression, we organize them as pose triplet units and feed them into the Transformer backbone.
Pre-training is performed via reconstructing the masked triplet unit from the corrupted input sequence.
It adaptively extracts the discrete pseudo label from the pose triplet unit, which represents the semantic gesture/body state.
arXiv Detail & Related papers (2023-02-10T06:23:44Z) - Sketch-Guided Text-to-Image Diffusion Models [57.12095262189362]
We introduce a universal approach to guide a pretrained text-to-image diffusion model.
Our method does not require to train a dedicated model or a specialized encoder for the task.
We take a particular focus on the sketch-to-image translation task, revealing a robust and expressive way to generate images.
arXiv Detail & Related papers (2022-11-24T18:45:32Z) - SLM: Learning a Discourse Language Representation with Sentence
Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation.
We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.