DiffSign: AI-Assisted Generation of Customizable Sign Language Videos With Enhanced Realism
- URL: http://arxiv.org/abs/2412.03878v1
- Date: Thu, 05 Dec 2024 05:18:28 GMT
- Title: DiffSign: AI-Assisted Generation of Customizable Sign Language Videos With Enhanced Realism
- Authors: Sudha Krishnamurthy, Vimal Bhat, Abhinav Jain,
- Abstract summary: We create sign language videos with synthetic signers that are realistic and expressive.<n>Our approach combines parametric modeling and generative modeling to generate realistic-looking synthetic signers.<n>Our results show that the sign language videos generated using our approach have better temporal consistency and realism than signing videos generated by a diffusion model conditioned only on text prompts.
- Score: 1.6536018920603175
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The proliferation of several streaming services in recent years has now made it possible for a diverse audience across the world to view the same media content, such as movies or TV shows. While translation and dubbing services are being added to make content accessible to the local audience, the support for making content accessible to people with different abilities, such as the Deaf and Hard of Hearing (DHH) community, is still lagging. Our goal is to make media content more accessible to the DHH community by generating sign language videos with synthetic signers that are realistic and expressive. Using the same signer for a given media content that is viewed globally may have limited appeal. Hence, our approach combines parametric modeling and generative modeling to generate realistic-looking synthetic signers and customize their appearance based on user preferences. We first retarget human sign language poses to 3D sign language avatars by optimizing a parametric model. The high-fidelity poses from the rendered avatars are then used to condition the poses of synthetic signers generated using a diffusion-based generative model. The appearance of the synthetic signer is controlled by an image prompt supplied through a visual adapter. Our results show that the sign language videos generated using our approach have better temporal consistency and realism than signing videos generated by a diffusion model conditioned only on text prompts. We also support multimodal prompts to allow users to further customize the appearance of the signer to accommodate diversity (e.g. skin tone, gender). Our approach is also useful for signer anonymization.
Related papers
- SignX: The Foundation Model for Sign Recognition [28.651340554377906]
This paper proposes SignX, a foundation model framework for sign recognition.
It is a concise yet powerful framework applicable to multiple human activity recognition scenarios.
Experimental results show that SignX can recognize signs from sign language video, producing predicted gloss representations with greater accuracy than has been reported in prior work.
arXiv Detail & Related papers (2025-04-22T23:23:39Z) - Shushing! Let's Imagine an Authentic Speech from the Silent Video [15.426152742881365]
Vision-guided speech generation aims to produce authentic speech from facial appearance or lip motions without relying on auditory signals.
Despite recent progress, existing methods struggle to achieve unified cross-modal alignment across semantics, timbre, and emotional prosody from visual cues.
We introduce ImaginTalk, a novel cross-modal diffusion framework that generates faithful speech using only visual input.
arXiv Detail & Related papers (2025-03-19T06:28:17Z) - DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models [72.24305287508474]
We introduce DiCoDe, a novel approach to generate videos with a language model in an autoregressive manner.
By treating videos as temporal sequences, DiCoDe fully harnesses the capabilities of language models for autoregressive generation.
We evaluate DiCoDe both quantitatively and qualitatively, demonstrating that it performs comparably to existing methods in terms of quality.
arXiv Detail & Related papers (2024-12-05T18:57:06Z) - Signs as Tokens: An Autoregressive Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs.<n>To align sign language with the LM, we develop a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts.<n>These sign tokens are integrated into the raw text vocabulary of the LM, allowing for supervised fine-tuning on sign language datasets.
arXiv Detail & Related papers (2024-11-26T18:28:09Z) - Towards Multi-Task Multi-Modal Models: A Video Generative Perspective [5.495245220300184]
This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions.
We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms.
Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
arXiv Detail & Related papers (2024-05-26T23:56:45Z) - DiffSLVA: Harnessing Diffusion Models for Sign Language Video
Anonymization [33.18321022815901]
We introduce DiffSLVA, a novel methodology for text-guided sign language video anonymization.
We develop a specialized module dedicated to capturing facial expressions, which are critical for conveying linguistic information in signed languages.
This innovative methodology makes possible, for the first time, sign language video anonymization that could be used for real-world applications.
arXiv Detail & Related papers (2023-11-27T18:26:19Z) - ChatAnything: Facetime Chat with LLM-Enhanced Personas [87.76804680223003]
We propose the mixture of voices (MoV) and the mixture of diffusers (MoD) for diverse voice and appearance generation.
For MoV, we utilize the text-to-speech (TTS) algorithms with a variety of pre-defined tones.
MoD, we combine the recent popular text-to-image generation techniques and talking head algorithms to streamline the process of generating talking objects.
arXiv Detail & Related papers (2023-11-12T08:29:41Z) - Text-image guided Diffusion Model for generating Deepfake celebrity
interactions [50.37578424163951]
Diffusion models have recently demonstrated highly realistic visual content generation.
This paper devises and explores a novel method in that regard.
Our results show that with the devised scheme, it is possible to create fake visual content with alarming realism.
arXiv Detail & Related papers (2023-09-26T08:24:37Z) - Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos [16.717930760451996]
We term this task avatar fingerprinting.
We first introduce a large-scale dataset of real and synthetic videos of people interacting on a video call.
We verify the identity driving the expressions in a synthetic video, by learning motion signatures that are independent of the facial appearance shown.
arXiv Detail & Related papers (2023-05-05T17:54:34Z) - DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder [55.58582254514431]
We propose DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech.<n>We also introduce pose modelling in speech2latent for pose controllability.<n>Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness.
arXiv Detail & Related papers (2023-03-30T17:18:31Z) - Neural Sign Reenactor: Deep Photorealistic Sign Language Retargeting [28.012212656892746]
We introduce a neural rendering pipeline for transferring the facial expressions, head pose, and body movements of one person in a source video to another in a target video.
Our method can be used for Sign Language Anonymization, Sign Language Production (synthesis module), as well as for reenacting other types of full body activities.
arXiv Detail & Related papers (2022-09-03T18:04:50Z) - Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign
Language Video [43.45785951443149]
To be truly understandable by Deaf communities, an automatic Sign Language Production system must generate a photo-realistic signer.
We propose SignGAN, the first SLP model to produce photo-realistic continuous sign language videos directly from spoken language.
A pose-conditioned human synthesis model is then introduced to generate a photo-realistic sign language video from the skeletal pose sequence.
arXiv Detail & Related papers (2020-11-19T14:31:06Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.