Reconstructing Signing Avatars From Video Using Linguistic Priors
- URL: http://arxiv.org/abs/2304.10482v1
- Date: Thu, 20 Apr 2023 17:29:50 GMT
- Title: Reconstructing Signing Avatars From Video Using Linguistic Priors
- Authors: Maria-Paola Forte and Peter Kulits and Chun-Hao Huang and Vasileios
Choutas and Dimitrios Tzionas and Katherine J. Kuchenbecker and Michael J.
Black
- Abstract summary: Sign language (SL) is the primary method of communication for the 70 million Deaf people around the world.
replacing video dictionaries of isolated signs with 3D avatars can aid learning and enable AR/VR applications.
SGNify captures fine-grained hand pose, facial expression, and body movement fully automatically from in-the-wild monocular SL videos.
- Score: 54.5282429129769
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sign language (SL) is the primary method of communication for the 70 million
Deaf people around the world. Video dictionaries of isolated signs are a core
SL learning tool. Replacing these with 3D avatars can aid learning and enable
AR/VR applications, improving access to technology and online media. However,
little work has attempted to estimate expressive 3D avatars from SL video;
occlusion, noise, and motion blur make this task difficult. We address this by
introducing novel linguistic priors that are universally applicable to SL and
provide constraints on 3D hand pose that help resolve ambiguities within
isolated signs. Our method, SGNify, captures fine-grained hand pose, facial
expression, and body movement fully automatically from in-the-wild monocular SL
videos. We evaluate SGNify quantitatively by using a commercial motion-capture
system to compute 3D avatars synchronized with monocular video. SGNify
outperforms state-of-the-art 3D body-pose- and shape-estimation methods on SL
videos. A perceptual study shows that SGNify's 3D reconstructions are
significantly more comprehensible and natural than those of previous methods
and are on par with the source videos. Code and data are available at
$\href{http://sgnify.is.tue.mpg.de}{\text{sgnify.is.tue.mpg.de}}$.
Related papers
- DEGAS: Detailed Expressions on Full-Body Gaussian Avatars [13.683836322899953]
We present DEGAS, the first 3D Gaussian Splatting (3DGS)-based modeling method for full-body avatars with rich facial expressions.
We propose to adopt the expression latent space trained solely on 2D portrait images, bridging the gap between 2D talking faces and 3D avatars.
arXiv Detail & Related papers (2024-08-20T06:52:03Z) - Expressive Whole-Body 3D Gaussian Avatar [34.3179424934446]
We present ExAvatar, an expressive whole-body 3D human avatar learned from a short monocular video.
The main challenges are 1) a limited diversity of facial expressions and poses in the video and 2) the absence of 3D observations, such as 3D scans and RGBD images.
arXiv Detail & Related papers (2024-07-31T15:29:13Z) - Neural Sign Actors: A diffusion model for 3D sign language production from text [51.81647203840081]
Sign Languages (SL) serve as the primary mode of communication for the Deaf and Hard of Hearing communities.
This work makes an important step towards realistic neural sign avatars, bridging the communication gap between Deaf and hearing communities.
arXiv Detail & Related papers (2023-12-05T12:04:34Z) - SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark [20.11364909443987]
SignAvatars is the first large-scale, multi-prompt 3D sign language (SL) motion dataset designed to bridge the communication gap for Deaf and hard-of-hearing individuals.
The dataset comprises 70,000 videos from 153 signers, totaling 8.34 million frames, covering both isolated signs and continuous, co-articulated signs.
arXiv Detail & Related papers (2023-10-31T13:15:49Z) - PLA: Language-Driven Open-Vocabulary 3D Scene Understanding [57.47315482494805]
Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space.
Recent breakthrough of 2D open-vocabulary perception is driven by Internet-scale paired image-text data with rich vocabulary concepts.
We propose to distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning multi-view images from 3D.
arXiv Detail & Related papers (2022-11-29T15:52:22Z) - Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures.
Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures.
We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z) - Body2Hands: Learning to Infer 3D Hands from Conversational Gesture Body
Dynamics [87.17505994436308]
We build upon the insight that body motion and hand gestures are strongly correlated in non-verbal communication settings.
We formulate the learning of this prior as a prediction task of 3D hand shape over time given body motion input alone.
Our hand prediction model produces convincing 3D hand gestures given only the 3D motion of the speaker's arms as input.
arXiv Detail & Related papers (2020-07-23T22:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.