SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark
- URL: http://arxiv.org/abs/2310.20436v3
- Date: Tue, 2 Jul 2024 15:10:28 GMT
- Title: SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark
- Authors: Zhengdi Yu, Shaoli Huang, Yongkang Cheng, Tolga Birdal,
- Abstract summary: SignAvatars is the first large-scale, multi-prompt 3D sign language (SL) motion dataset designed to bridge the communication gap for Deaf and hard-of-hearing individuals.
The dataset comprises 70,000 videos from 153 signers, totaling 8.34 million frames, covering both isolated signs and continuous, co-articulated signs.
- Score: 20.11364909443987
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present SignAvatars, the first large-scale, multi-prompt 3D sign language (SL) motion dataset designed to bridge the communication gap for Deaf and hard-of-hearing individuals. While there has been an exponentially growing number of research regarding digital communication, the majority of existing communication technologies primarily cater to spoken or written languages, instead of SL, the essential communication method for Deaf and hard-of-hearing communities. Existing SL datasets, dictionaries, and sign language production (SLP) methods are typically limited to 2D as annotating 3D models and avatars for SL is usually an entirely manual and labor-intensive process conducted by SL experts, often resulting in unnatural avatars. In response to these challenges, we compile and curate the SignAvatars dataset, which comprises 70,000 videos from 153 signers, totaling 8.34 million frames, covering both isolated signs and continuous, co-articulated signs, with multiple prompts including HamNoSys, spoken language, and words. To yield 3D holistic annotations, including meshes and biomechanically-valid poses of body, hands, and face, as well as 2D and 3D keypoints, we introduce an automated annotation pipeline operating on our large corpus of SL videos. SignAvatars facilitates various tasks such as 3D sign language recognition (SLR) and the novel 3D SL production (SLP) from diverse inputs like text scripts, individual words, and HamNoSys notation. Hence, to evaluate the potential of SignAvatars, we further propose a unified benchmark of 3D SL holistic motion production. We believe that this work is a significant step forward towards bringing the digital world to the Deaf and hard-of-hearing communities as well as people interacting with them.
Related papers
- 3D-LEX v1.0: 3D Lexicons for American Sign Language and Sign Language of the Netherlands [1.8641315013048299]
We present an efficient approach for capturing sign language in 3D, introduce the 3D-LEX dataset, and detail a method for semi-automatic annotation of phonetic properties.
Our procedure integrates three motion capture techniques encompassing high-resolution 3D poses, 3D handshapes, and depth-aware facial features.
The 3D-LEX dataset includes 1,000 signs from American Sign Language and an additional 1,000 signs from the Sign Language of the Netherlands.
arXiv Detail & Related papers (2024-09-03T13:44:56Z) - MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in
3D World [55.878173953175356]
We propose MultiPLY, a multisensory embodied large language model.
We first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data.
We demonstrate that MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks.
arXiv Detail & Related papers (2024-01-16T18:59:45Z) - A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars [49.60328609426056]
Spoken2Sign is a system for translating spoken languages into sign languages.
We present a simple baseline consisting of three steps: creating a gloss-video dictionary, estimating a 3D sign for each sign video, and training a Spoken2Sign model.
As far as we know, we are the first to present the Spoken2Sign task in an output format of 3D signs.
arXiv Detail & Related papers (2024-01-09T18:59:49Z) - Neural Sign Actors: A diffusion model for 3D sign language production from text [51.81647203840081]
Sign Languages (SL) serve as the primary mode of communication for the Deaf and Hard of Hearing communities.
This work makes an important step towards realistic neural sign avatars, bridging the communication gap between Deaf and hearing communities.
arXiv Detail & Related papers (2023-12-05T12:04:34Z) - Reconstructing Signing Avatars From Video Using Linguistic Priors [54.5282429129769]
Sign language (SL) is the primary method of communication for the 70 million Deaf people around the world.
replacing video dictionaries of isolated signs with 3D avatars can aid learning and enable AR/VR applications.
SGNify captures fine-grained hand pose, facial expression, and body movement fully automatically from in-the-wild monocular SL videos.
arXiv Detail & Related papers (2023-04-20T17:29:50Z) - Skeleton Based Sign Language Recognition Using Whole-body Keypoints [71.97020373520922]
Sign language is used by deaf or speech impaired people to communicate.
Skeleton-based recognition is becoming popular that it can be further ensembled with RGB-D based method to achieve state-of-the-art performance.
Inspired by the recent development of whole-body pose estimation citejin 2020whole, we propose recognizing sign language based on the whole-body key points and features.
arXiv Detail & Related papers (2021-03-16T03:38:17Z) - Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures.
Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures.
We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z) - Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign
Language Video [43.45785951443149]
To be truly understandable by Deaf communities, an automatic Sign Language Production system must generate a photo-realistic signer.
We propose SignGAN, the first SLP model to produce photo-realistic continuous sign language videos directly from spoken language.
A pose-conditioned human synthesis model is then introduced to generate a photo-realistic sign language video from the skeletal pose sequence.
arXiv Detail & Related papers (2020-11-19T14:31:06Z) - How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign
Language [37.578776156503906]
How2Sign is a multimodal and multiview continuous American Sign Language (ASL) dataset.
It consists of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth.
A three-hour subset was recorded in the Panoptic studio enabling detailed 3D pose estimation.
arXiv Detail & Related papers (2020-08-18T20:22:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.