SignAvatar: Sign Language 3D Motion Reconstruction and Generation
- URL: http://arxiv.org/abs/2405.07974v1
- Date: Mon, 13 May 2024 17:48:22 GMT
- Title: SignAvatar: Sign Language 3D Motion Reconstruction and Generation
- Authors: Lu Dong, Lipisha Chaudhary, Fei Xu, Xiao Wang, Mason Lary, Ifeoma Nwogu,
- Abstract summary: SignAvatar is a framework capable of both word-level sign language reconstruction and generation.
We contribute the ASL3DWord dataset, composed of 3D joint rotation data for the body, hands, and face.
- Score: 10.342253593687781
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Achieving expressive 3D motion reconstruction and automatic generation for isolated sign words can be challenging, due to the lack of real-world 3D sign-word data, the complex nuances of signing motions, and the cross-modal understanding of sign language semantics. To address these challenges, we introduce SignAvatar, a framework capable of both word-level sign language reconstruction and generation. SignAvatar employs a transformer-based conditional variational autoencoder architecture, effectively establishing relationships across different semantic modalities. Additionally, this approach incorporates a curriculum learning strategy to enhance the model's robustness and generalization, resulting in more realistic motions. Furthermore, we contribute the ASL3DWord dataset, composed of 3D joint rotation data for the body, hands, and face, for unique sign words. We demonstrate the effectiveness of SignAvatar through extensive experiments, showcasing its superior reconstruction and automatic generation capabilities. The code and dataset are available on the project page.
Related papers
- MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production [93.32354378820648]
We propose a unified framework for continuous sign language production, easing communication between sign and non-sign language users.
A sequence diffusion model, utilizing embeddings extracted from text or speech, is crafted to generate sign predictions step by step.
Experiments on How2Sign and PHOENIX14T datasets demonstrate that our model achieves competitive performance in sign language production.
arXiv Detail & Related papers (2024-07-04T13:53:50Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition [94.56755080185732]
We propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information.
Our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation.
arXiv Detail & Related papers (2024-05-31T08:06:05Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - EmoVOCA: Speech-Driven Emotional 3D Talking Heads [12.161006152509653]
We propose an innovative data-driven technique for creating a synthetic dataset, called EmoVOCA.
We then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face.
arXiv Detail & Related papers (2024-03-19T16:33:26Z) - A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars [49.60328609426056]
Spoken2Sign is a system for translating spoken languages into sign languages.
We present a simple baseline consisting of three steps: creating a gloss-video dictionary, estimating a 3D sign for each sign video, and training a Spoken2Sign model.
As far as we know, we are the first to present the Spoken2Sign task in an output format of 3D signs.
arXiv Detail & Related papers (2024-01-09T18:59:49Z) - SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark [20.11364909443987]
SignAvatars is the first large-scale, multi-prompt 3D sign language (SL) motion dataset designed to bridge the communication gap for Deaf and hard-of-hearing individuals.
The dataset comprises 70,000 videos from 153 signers, totaling 8.34 million frames, covering both isolated signs and continuous, co-articulated signs.
arXiv Detail & Related papers (2023-10-31T13:15:49Z) - Learning Compositional Representation for 4D Captures with Neural ODE [72.56606274691033]
We introduce a compositional representation for 4D captures, that disentangles shape, initial state, and motion respectively.
To model the motion, a neural Ordinary Differential Equation (ODE) is trained to update the initial state conditioned on the learned motion code.
A decoder takes the shape code and the updated pose code to reconstruct 4D captures at each time stamp.
arXiv Detail & Related papers (2021-03-15T10:55:55Z) - Independent Sign Language Recognition with 3D Body, Hands, and Face
Reconstruction [46.70761714133466]
Independent Sign Language Recognition is a complex visual recognition problem that combines several challenging tasks of Computer Vision.
No work has adequately combined all three information channels to efficiently recognize Sign Language.
We employ SMPL-X, a contemporary parametric model that enables joint extraction of 3D body shape, face and hands information from a single image.
arXiv Detail & Related papers (2020-11-24T23:50:26Z) - Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign
Language Video [43.45785951443149]
To be truly understandable by Deaf communities, an automatic Sign Language Production system must generate a photo-realistic signer.
We propose SignGAN, the first SLP model to produce photo-realistic continuous sign language videos directly from spoken language.
A pose-conditioned human synthesis model is then introduced to generate a photo-realistic sign language video from the skeletal pose sequence.
arXiv Detail & Related papers (2020-11-19T14:31:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.