Related papers: Text-Driven Diffusion Model for Sign Language Production

Text-Driven Diffusion Model for Sign Language Production

URL: http://arxiv.org/abs/2503.15914v1
Date: Thu, 20 Mar 2025 07:45:27 GMT
Title: Text-Driven Diffusion Model for Sign Language Production
Authors: Jiayi He, Xu Wang, Ruobei Zhang, Shengeng Tang, Yaxiong Wang, Lechao Cheng,
Abstract summary: We introduce the hfut-lmc team's solution to the SLRTP Sign Production Challenge.<n>The challenge aims to generate semantically aligned sign language pose sequences from text inputs.<n>Our solution achieves a BLEU-1 score of 20.17, placing second in the challenge.
Score: 13.671593137551268
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce the hfut-lmc team's solution to the SLRTP Sign Production Challenge. The challenge aims to generate semantically aligned sign language pose sequences from text inputs. To this end, we propose a Text-driven Diffusion Model (TDM) framework. During the training phase, TDM utilizes an encoder to encode text sequences and incorporates them into the diffusion model as conditional input to generate sign pose sequences. To guarantee the high quality and accuracy of the generated pose sequences, we utilize two key loss functions. The joint loss function L_{joint} is used to precisely measure and minimize the differences between the joint positions of the generated pose sequences and those of the ground truth. Similarly, the bone orientation loss function L_{bone} is instrumental in ensuring that the orientation of the bones in the generated poses aligns with the actual, correct orientations. In the inference stage, the TDM framework takes on a different yet equally important task. It starts with noisy sequences and, under the strict constraints of the text conditions, gradually refines and generates semantically consistent sign language pose sequences. Our carefully designed framework performs well on the sign language production task, and our solution achieves a BLEU-1 score of 20.17, placing second in the challenge.

Related papers

Disentangle and Regularize: Sign Language Production with Articulator-Based Disentanglement and Channel-Aware Regularization [1.8024397171920885]
We train a pose autoencoder that encodes sign poses into a compact latent space using an articulator-based disentanglement strategy. A non-autoregressive transformer decoder is trained to predict latent representations from sentence-level text embeddings.
arXiv Detail & Related papers (2025-04-09T06:14:19Z)
MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production [93.32354378820648]
We propose a unified framework for continuous sign language production, easing communication between sign and non-sign language users. A sequence diffusion model, utilizing embeddings extracted from text or speech, is crafted to generate sign predictions step by step. Experiments on How2Sign and PHOENIX14T datasets demonstrate that our model achieves competitive performance in sign language production.
arXiv Detail & Related papers (2024-07-04T13:53:50Z)
Sign Stitching: A Novel Approach to Sign Language Production [35.35777909051466]
We propose using dictionary examples to create expressive sign language sequences. We present a 7-step approach to effectively stitch the signs together. We leverage the SignGAN model to map the output to a photo-realistic signer.
arXiv Detail & Related papers (2024-05-13T11:44:57Z)
Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features. Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z)
SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign Language Understanding [132.78015553111234]
Hand gesture serves as a crucial role during the expression of sign language. Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource. We propose the first self-supervised pre-trainable SignBERT+ framework with model-aware hand prior incorporated.
arXiv Detail & Related papers (2023-05-08T17:16:38Z)
BEST: BERT Pre-Training for Sign Language Recognition with Coupling Tokenization [135.73436686653315]
We are dedicated to leveraging the BERT pre-training success and modeling the domain-specific statistics to fertilize the sign language recognition( SLR) model. Considering the dominance of hand and body in sign language expression, we organize them as pose triplet units and feed them into the Transformer backbone. Pre-training is performed via reconstructing the masked triplet unit from the corrupted input sequence. It adaptively extracts the discrete pseudo label from the pose triplet unit, which represents the semantic gesture/body state.
arXiv Detail & Related papers (2023-02-10T06:23:44Z)
Ham2Pose: Animating Sign Language Notation into Pose Sequences [9.132706284440276]
Translating spoken languages into Sign languages is necessary for open communication between the hearing and hearing-impaired communities. We propose the first method for animating a text written in HamNoSys, a lexical Sign language notation, into signed pose sequences.
arXiv Detail & Related papers (2022-11-24T13:59:32Z)
G2P-DDM: Generating Sign Pose Sequence from Gloss Sequence with Discrete Diffusion Model [8.047896755805981]
The Sign Language Production project aims to automatically translate spoken languages into sign sequences. We present a novel solution by converting the continuous pose space generation problem into a discrete sequence generation problem. Our results show that our model outperforms state-of-the-art G2P models on the public SLP evaluation benchmark.
arXiv Detail & Related papers (2022-08-19T03:49:13Z)
Don't Take It Literally: An Edit-Invariant Sequence Loss for Text Generation [109.46348908829697]
We propose a novel Edit-Invariant Sequence Loss (EISL), which computes the matching loss of a target n-gram with all n-grams in the generated sequence. We conduct experiments on three tasks: machine translation with noisy target sequences, unsupervised text style transfer, and non-autoregressive machine translation.
arXiv Detail & Related papers (2021-06-29T03:59:21Z)
COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining [59.169836983883656]
COCO-LM is a new self-supervised learning framework that pretrains Language Models by COrrecting challenging errors and COntrasting text sequences. COCO-LM employs an auxiliary language model to mask-and-predict tokens in original text sequences. Our analyses reveal that COCO-LM's advantages come from its challenging training signals, more contextualized token representations, and regularized sequence representations.
arXiv Detail & Related papers (2021-02-16T22:24:29Z)
Rethinking Positional Encoding in Language Pre-training [111.2320727291926]
We show that in absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations. We propose a new positional encoding method called textbfTransformer with textbfUntied textPositional textbfEncoding (T)
arXiv Detail & Related papers (2020-06-28T13:11:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.