SignDiff: Diffusion Models for American Sign Language Production
- URL: http://arxiv.org/abs/2308.16082v2
- Date: Sat, 19 Oct 2024 21:18:44 GMT
- Title: SignDiff: Diffusion Models for American Sign Language Production
- Authors: Sen Fang, Chunyu Sui, Yanghao Zhou, Xuedong Zhang, Hongbin Zhong, Minyu Zhao, Yapeng Tian, Chen Chen,
- Abstract summary: We propose a dual-condition diffusion pre-training model named SignDiff that can generate human sign language speakers from a skeleton pose.
We also propose a new method for American Sign Language Production (ASLP), which can generate ASL skeletal pose videos from text input.
- Score: 23.82668888574089
- License:
- Abstract: In this paper, we propose a dual-condition diffusion pre-training model named SignDiff that can generate human sign language speakers from a skeleton pose. SignDiff has a novel Frame Reinforcement Network called FR-Net, similar to dense human pose estimation work, which enhances the correspondence between text lexical symbols and sign language dense pose frames, reduces the occurrence of multiple fingers in the diffusion model. In addition, we propose a new method for American Sign Language Production (ASLP), which can generate ASL skeletal pose videos from text input, integrating two new improved modules and a new loss function to improve the accuracy and quality of sign language skeletal posture and enhance the ability of the model to train on large-scale data. We propose the first baseline for ASL production and report the scores of 17.19 and 12.85 on BLEU-4 on the How2Sign dev/test sets. We evaluated our model on the previous mainstream dataset PHOENIX14T, and our method achieved the SOTA results. In addition, our image quality far exceeds all previous results by 10 percentage points in terms of SSIM.
Related papers
- KD-MSLRT: Lightweight Sign Language Recognition Model Based on Mediapipe and 3D to 1D Knowledge Distillation [8.891724904033582]
We propose a cross-modal multi-knowledge distillation technique from 3D to 1D and a novel end-to-end pre-training text correction framework.
Our model achieves a decrease in Word Error Rate (WER) of at least 1.4% on PHOENIX14 and PHOENIX14T datasets compared to the state-of-the-art CorrNet.
We have also collected and released extensive Chinese sign language datasets, and developed a specialized training vocabulary.
arXiv Detail & Related papers (2025-01-04T15:59:33Z) - Signs as Tokens: An Autoregressive Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs.
To align sign language with the LM, we develop a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts.
These sign tokens are integrated into the raw text vocabulary of the LM, allowing for supervised fine-tuning on sign language datasets.
arXiv Detail & Related papers (2024-11-26T18:28:09Z) - T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text [59.57676466961787]
We propose a novel dynamic vector quantization (DVA-VAE) model that can adjust the encoding length based on the information density in sign language.
Experiments conducted on the PHOENIX14T dataset demonstrate the effectiveness of our proposed method.
We propose a new large German sign language dataset, PHOENIX-News, which contains 486 hours of sign language videos, audio, and transcription texts.
arXiv Detail & Related papers (2024-06-11T10:06:53Z) - SignLLM: Sign Language Production Large Language Models [31.036549195000667]
We propose SignLLM, a multilingual Sign Language Production (SLP) large language model.
It includes two novel multilingual SLP modes MLSF and Prompt2LangGloss that allow sign language gestures generation.
For SignLLM's training, we introduce Prompt2Sign, a comprehensive multilingual sign language dataset.
arXiv Detail & Related papers (2024-05-17T12:01:43Z) - Sign Language Production with Latent Motion Transformer [2.184775414778289]
We develop a new method to make high-quality sign videos without using human poses as a middle step.
Our model works in two main parts: first, it learns from a generator and the video's hidden features, and next, it uses another model to understand the order of these hidden features.
Compared with previous state-of-the-art approaches, our model performs consistently better on two word-level sign language datasets.
arXiv Detail & Related papers (2023-12-20T10:53:06Z) - SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign
Language Understanding [132.78015553111234]
Hand gesture serves as a crucial role during the expression of sign language.
Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource.
We propose the first self-supervised pre-trainable SignBERT+ framework with model-aware hand prior incorporated.
arXiv Detail & Related papers (2023-05-08T17:16:38Z) - DiffusionBERT: Improving Generative Masked Language Models with
Diffusion Models [81.84866217721361]
DiffusionBERT is a new generative masked language model based on discrete diffusion models.
We propose a new noise schedule for the forward diffusion process that controls the degree of noise added at each step.
Experiments on unconditional text generation demonstrate that DiffusionBERT achieves significant improvement over existing diffusion models for text.
arXiv Detail & Related papers (2022-11-28T03:25:49Z) - Shifted Diffusion for Text-to-image Generation [65.53758187995744]
Corgi is based on our proposed shifted diffusion model, which achieves better image embedding generation from input text.
Corgi also achieves new state-of-the-art results across different datasets on downstream language-free text-to-image generation tasks.
arXiv Detail & Related papers (2022-11-24T03:25:04Z) - Signing at Scale: Learning to Co-Articulate Signs for Large-Scale
Photo-Realistic Sign Language Production [43.45785951443149]
Sign languages are visual languages, with vocabularies as rich as their spoken language counterparts.
Current deep-learning based Sign Language Production (SLP) models produce under-articulated skeleton pose sequences.
We tackle large-scale SLP by learning to co-articulate between dictionary signs.
We also propose SignGAN, a pose-conditioned human synthesis model that produces photo-realistic sign language videos.
arXiv Detail & Related papers (2022-03-29T08:51:38Z) - Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese
Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT.
We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers.
We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z) - Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign
Language Video [43.45785951443149]
To be truly understandable by Deaf communities, an automatic Sign Language Production system must generate a photo-realistic signer.
We propose SignGAN, the first SLP model to produce photo-realistic continuous sign language videos directly from spoken language.
A pose-conditioned human synthesis model is then introduced to generate a photo-realistic sign language video from the skeletal pose sequence.
arXiv Detail & Related papers (2020-11-19T14:31:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.