SignDiff: Learning Diffusion Models for American Sign Language
Production
- URL: http://arxiv.org/abs/2308.16082v1
- Date: Wed, 30 Aug 2023 15:14:56 GMT
- Title: SignDiff: Learning Diffusion Models for American Sign Language
Production
- Authors: Sen Fang, Chunyu Sui, Xuedong Zhang, Yapeng Tian
- Abstract summary: The field of Sign Language Production lacked a large-scale, pre-trained model based on deep learning for continuous American Sign Language (ASL) production in the past decade.
We propose SignDiff, a dual-condition diffusion pre-training model that can generate human sign language speakers from a skeleton pose.
Our ASLP method proposes two new improved modules and a new loss function to improve the accuracy and quality of sign language skeletal posture.
- Score: 27.899654531461238
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The field of Sign Language Production (SLP) lacked a large-scale, pre-trained
model based on deep learning for continuous American Sign Language (ASL)
production in the past decade. This limitation hampers communication for all
individuals with disabilities relying on ASL. To address this issue, we
undertook the secondary development and utilization of How2Sign, one of the
largest publicly available ASL datasets. Despite its significance, prior
researchers in the field of sign language have not effectively employed this
corpus due to the intricacies involved in American Sign Language Production
(ASLP).
To conduct large-scale ASLP, we propose SignDiff based on the latest work in
related fields, which is a dual-condition diffusion pre-training model that can
generate human sign language speakers from a skeleton pose. SignDiff has a
novel Frame Reinforcement Network called FR-Net, similar to dense human pose
estimation work, which enhances the correspondence between text lexical symbols
and sign language dense pose frames reduce the occurrence of multiple fingers
in the diffusion model. In addition, our ASLP method proposes two new improved
modules and a new loss function to improve the accuracy and quality of sign
language skeletal posture and enhance the ability of the model to train on
large-scale data.
We propose the first baseline for ASL production and report the scores of
17.19 and 12.85 on BLEU-4 on the How2Sign dev/test sets. We also evaluated our
model on the previous mainstream dataset called PHOENIX14T, and the main
experiments achieved the results of SOTA. In addition, our image quality far
exceeds all previous results by 10 percentage points on the SSIM indicator.
Finally, we conducted ablation studies and qualitative evaluations for
discussion.
Related papers
- Hierarchical Windowed Graph Attention Network and a Large Scale Dataset for Isolated Indian Sign Language Recognition [0.20075899678041528]
We propose a large-scale isolated ISL dataset and a novel SL recognition model based on skeleton graph structure.
The dataset covers 2,002 daily used common words in the deaf community recorded by 20 (10 male and 10 female) deaf adult signers.
arXiv Detail & Related papers (2024-07-19T11:48:36Z) - SignSpeak: Open-Source Time Series Classification for ASL Translation [0.12499537119440243]
We propose a low-cost, real-time ASL-to-speech translation glove and an exhaustive training dataset of sign language patterns.
We benchmarked this dataset with supervised learning models, such as LSTMs, GRUs and Transformers, where our best model achieved 92% accuracy.
Our open-source dataset, models and glove designs provide an accurate and efficient ASL translator while maintaining cost-effectiveness.
arXiv Detail & Related papers (2024-06-27T17:58:54Z) - SignLLM: Sign Languages Production Large Language Models [33.438444361552854]
We introduce the first comprehensive multilingual sign language dataset named Prompt2Sign.
Our dataset transforms a vast array of videos into a streamlined, model-friendly format.
We propose SignLLM, the first multilingual Sign Language Production model.
arXiv Detail & Related papers (2024-05-17T12:01:43Z) - Neural Sign Actors: A diffusion model for 3D sign language production from text [51.81647203840081]
Sign Languages (SL) serve as the primary mode of communication for the Deaf and Hard of Hearing communities.
This work makes an important step towards realistic neural sign avatars, bridging the communication gap between Deaf and hearing communities.
arXiv Detail & Related papers (2023-12-05T12:04:34Z) - SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign
Language Understanding [132.78015553111234]
Hand gesture serves as a crucial role during the expression of sign language.
Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource.
We propose the first self-supervised pre-trainable SignBERT+ framework with model-aware hand prior incorporated.
arXiv Detail & Related papers (2023-05-08T17:16:38Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - Signing at Scale: Learning to Co-Articulate Signs for Large-Scale
Photo-Realistic Sign Language Production [43.45785951443149]
Sign languages are visual languages, with vocabularies as rich as their spoken language counterparts.
Current deep-learning based Sign Language Production (SLP) models produce under-articulated skeleton pose sequences.
We tackle large-scale SLP by learning to co-articulate between dictionary signs.
We also propose SignGAN, a pose-conditioned human synthesis model that produces photo-realistic sign language videos.
arXiv Detail & Related papers (2022-03-29T08:51:38Z) - A Simple Multi-Modality Transfer Learning Baseline for Sign Language
Translation [54.29679610921429]
Existing sign language datasets contain only about 10K-20K pairs of sign videos, gloss annotations and texts.
Data is thus a bottleneck for training effective sign language translation models.
This simple baseline surpasses the previous state-of-the-art results on two sign language translation benchmarks.
arXiv Detail & Related papers (2022-03-08T18:59:56Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - BSL-1K: Scaling up co-articulated sign language recognition using
mouthing cues [106.21067543021887]
We show how to use mouthing cues from signers to obtain high-quality annotations from video data.
The BSL-1K dataset is a collection of British Sign Language (BSL) signs of unprecedented scale.
arXiv Detail & Related papers (2020-07-23T16:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.