Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign
Language Video
- URL: http://arxiv.org/abs/2011.09846v4
- Date: Thu, 26 Nov 2020 19:00:34 GMT
- Title: Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign
Language Video
- Authors: Ben Saunders, Necati Cihan Camgoz, Richard Bowden
- Abstract summary: To be truly understandable by Deaf communities, an automatic Sign Language Production system must generate a photo-realistic signer.
We propose SignGAN, the first SLP model to produce photo-realistic continuous sign language videos directly from spoken language.
A pose-conditioned human synthesis model is then introduced to generate a photo-realistic sign language video from the skeletal pose sequence.
- Score: 43.45785951443149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To be truly understandable and accepted by Deaf communities, an automatic
Sign Language Production (SLP) system must generate a photo-realistic signer.
Prior approaches based on graphical avatars have proven unpopular, whereas
recent neural SLP works that produce skeleton pose sequences have been shown to
be not understandable to Deaf viewers.
In this paper, we propose SignGAN, the first SLP model to produce
photo-realistic continuous sign language videos directly from spoken language.
We employ a transformer architecture with a Mixture Density Network (MDN)
formulation to handle the translation from spoken language to skeletal pose. A
pose-conditioned human synthesis model is then introduced to generate a
photo-realistic sign language video from the skeletal pose sequence. This
allows the photo-realistic production of sign videos directly translated from
written text.
We further propose a novel keypoint-based loss function, which significantly
improves the quality of synthesized hand images, operating in the keypoint
space to avoid issues caused by motion blur. In addition, we introduce a method
for controllable video generation, enabling training on large, diverse sign
language datasets and providing the ability to control the signer appearance at
inference.
Using a dataset of eight different sign language interpreters extracted from
broadcast footage, we show that SignGAN significantly outperforms all baseline
methods for quantitative metrics and human perceptual studies.
Related papers
- EvSign: Sign Language Recognition and Translation with Streaming Events [59.51655336911345]
Event camera could naturally perceive dynamic hand movements, providing rich manual clues for sign language tasks.
We propose efficient transformer-based framework for event-based SLR and SLT tasks.
Our method performs favorably against existing state-of-the-art approaches with only 0.34% computational cost.
arXiv Detail & Related papers (2024-07-17T14:16:35Z) - SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale [22.49602248323602]
A persistent challenge in sign language video processing is how we learn representations of sign language.
Our proposed method focuses on just the most relevant parts in a signing video: the face, hands and body posture of the signer.
Our approach is based on learning from individual frames (rather than video sequences) and is therefore much more efficient than prior work on sign language pre-training.
arXiv Detail & Related papers (2024-06-11T03:00:41Z) - Neural Sign Actors: A diffusion model for 3D sign language production from text [51.81647203840081]
Sign Languages (SL) serve as the primary mode of communication for the Deaf and Hard of Hearing communities.
This work makes an important step towards realistic neural sign avatars, bridging the communication gap between Deaf and hearing communities.
arXiv Detail & Related papers (2023-12-05T12:04:34Z) - DiffSLVA: Harnessing Diffusion Models for Sign Language Video
Anonymization [33.18321022815901]
We introduce DiffSLVA, a novel methodology for text-guided sign language video anonymization.
We develop a specialized module dedicated to capturing facial expressions, which are critical for conveying linguistic information in signed languages.
This innovative methodology makes possible, for the first time, sign language video anonymization that could be used for real-world applications.
arXiv Detail & Related papers (2023-11-27T18:26:19Z) - SignDiff: Diffusion Models for American Sign Language Production [23.82668888574089]
We propose a dual-condition diffusion pre-training model named SignDiff that can generate human sign language speakers from a skeleton pose.
We also propose a new method for American Sign Language Production (ASLP), which can generate ASL skeletal pose videos from text input.
arXiv Detail & Related papers (2023-08-30T15:14:56Z) - SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign
Language Understanding [132.78015553111234]
Hand gesture serves as a crucial role during the expression of sign language.
Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource.
We propose the first self-supervised pre-trainable SignBERT+ framework with model-aware hand prior incorporated.
arXiv Detail & Related papers (2023-05-08T17:16:38Z) - Signing at Scale: Learning to Co-Articulate Signs for Large-Scale
Photo-Realistic Sign Language Production [43.45785951443149]
Sign languages are visual languages, with vocabularies as rich as their spoken language counterparts.
Current deep-learning based Sign Language Production (SLP) models produce under-articulated skeleton pose sequences.
We tackle large-scale SLP by learning to co-articulate between dictionary signs.
We also propose SignGAN, a pose-conditioned human synthesis model that produces photo-realistic sign language videos.
arXiv Detail & Related papers (2022-03-29T08:51:38Z) - Skeleton Based Sign Language Recognition Using Whole-body Keypoints [71.97020373520922]
Sign language is used by deaf or speech impaired people to communicate.
Skeleton-based recognition is becoming popular that it can be further ensembled with RGB-D based method to achieve state-of-the-art performance.
Inspired by the recent development of whole-body pose estimation citejin 2020whole, we propose recognizing sign language based on the whole-body key points and features.
arXiv Detail & Related papers (2021-03-16T03:38:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.