DiffSign: AI-Assisted Generation of Customizable Sign Language Videos With Enhanced Realism
- URL: http://arxiv.org/abs/2412.03878v1
- Date: Thu, 05 Dec 2024 05:18:28 GMT
- Title: DiffSign: AI-Assisted Generation of Customizable Sign Language Videos With Enhanced Realism
- Authors: Sudha Krishnamurthy, Vimal Bhat, Abhinav Jain,
- Abstract summary: We create sign language videos with synthetic signers that are realistic and expressive.
Our approach combines parametric modeling and generative modeling to generate realistic-looking synthetic signers.
Our results show that the sign language videos generated using our approach have better temporal consistency and realism than signing videos generated by a diffusion model conditioned only on text prompts.
- Score: 1.6536018920603175
- License:
- Abstract: The proliferation of several streaming services in recent years has now made it possible for a diverse audience across the world to view the same media content, such as movies or TV shows. While translation and dubbing services are being added to make content accessible to the local audience, the support for making content accessible to people with different abilities, such as the Deaf and Hard of Hearing (DHH) community, is still lagging. Our goal is to make media content more accessible to the DHH community by generating sign language videos with synthetic signers that are realistic and expressive. Using the same signer for a given media content that is viewed globally may have limited appeal. Hence, our approach combines parametric modeling and generative modeling to generate realistic-looking synthetic signers and customize their appearance based on user preferences. We first retarget human sign language poses to 3D sign language avatars by optimizing a parametric model. The high-fidelity poses from the rendered avatars are then used to condition the poses of synthetic signers generated using a diffusion-based generative model. The appearance of the synthetic signer is controlled by an image prompt supplied through a visual adapter. Our results show that the sign language videos generated using our approach have better temporal consistency and realism than signing videos generated by a diffusion model conditioned only on text prompts. We also support multimodal prompts to allow users to further customize the appearance of the signer to accommodate diversity (e.g. skin tone, gender). Our approach is also useful for signer anonymization.
Related papers
- DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models [72.24305287508474]
We introduce DiCoDe, a novel approach to generate videos with a language model in an autoregressive manner.
By treating videos as temporal sequences, DiCoDe fully harnesses the capabilities of language models for autoregressive generation.
We evaluate DiCoDe both quantitatively and qualitatively, demonstrating that it performs comparably to existing methods in terms of quality.
arXiv Detail & Related papers (2024-12-05T18:57:06Z) - Signs as Tokens: An Autoregressive Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs.
To align sign language with the LM, we develop a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts.
These sign tokens are integrated into the raw text vocabulary of the LM, allowing for supervised fine-tuning on sign language datasets.
arXiv Detail & Related papers (2024-11-26T18:28:09Z) - Towards Multi-Task Multi-Modal Models: A Video Generative Perspective [5.495245220300184]
This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions.
We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms.
Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
arXiv Detail & Related papers (2024-05-26T23:56:45Z) - DiffSLVA: Harnessing Diffusion Models for Sign Language Video
Anonymization [33.18321022815901]
We introduce DiffSLVA, a novel methodology for text-guided sign language video anonymization.
We develop a specialized module dedicated to capturing facial expressions, which are critical for conveying linguistic information in signed languages.
This innovative methodology makes possible, for the first time, sign language video anonymization that could be used for real-world applications.
arXiv Detail & Related papers (2023-11-27T18:26:19Z) - ChatAnything: Facetime Chat with LLM-Enhanced Personas [87.76804680223003]
We propose the mixture of voices (MoV) and the mixture of diffusers (MoD) for diverse voice and appearance generation.
For MoV, we utilize the text-to-speech (TTS) algorithms with a variety of pre-defined tones.
MoD, we combine the recent popular text-to-image generation techniques and talking head algorithms to streamline the process of generating talking objects.
arXiv Detail & Related papers (2023-11-12T08:29:41Z) - DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder [55.58582254514431]
We propose DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech.
We also introduce pose modelling in speech2latent for pose controllability.
Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness.
arXiv Detail & Related papers (2023-03-30T17:18:31Z) - Signing at Scale: Learning to Co-Articulate Signs for Large-Scale
Photo-Realistic Sign Language Production [43.45785951443149]
Sign languages are visual languages, with vocabularies as rich as their spoken language counterparts.
Current deep-learning based Sign Language Production (SLP) models produce under-articulated skeleton pose sequences.
We tackle large-scale SLP by learning to co-articulate between dictionary signs.
We also propose SignGAN, a pose-conditioned human synthesis model that produces photo-realistic sign language videos.
arXiv Detail & Related papers (2022-03-29T08:51:38Z) - AnonySIGN: Novel Human Appearance Synthesis for Sign Language Video
Anonymisation [37.679114155300084]
We introduce the task of Sign Language Video Anonymisation (SLVA) as an automatic method to anonymise the visual appearance of a sign language video.
To tackle SLVA, we propose AnonySign, a novel automatic approach for visual anonymisation of sign language data.
arXiv Detail & Related papers (2021-07-22T13:42:18Z) - Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign
Language Video [43.45785951443149]
To be truly understandable by Deaf communities, an automatic Sign Language Production system must generate a photo-realistic signer.
We propose SignGAN, the first SLP model to produce photo-realistic continuous sign language videos directly from spoken language.
A pose-conditioned human synthesis model is then introduced to generate a photo-realistic sign language video from the skeletal pose sequence.
arXiv Detail & Related papers (2020-11-19T14:31:06Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.