A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony
in Talking Head Generation
- URL: http://arxiv.org/abs/2307.03270v1
- Date: Tue, 4 Jul 2023 08:29:59 GMT
- Title: A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony
in Talking Head Generation
- Authors: Louis Airale (UGA, LIG), Dominique Vaufreydaz (LIG), Xavier
Alameda-Pineda (UGA)
- Abstract summary: We propose a multi-scale audio-visual synchrony loss and a multi-scale autoregressive GAN to better handle short and long-term correlation between speech and head motion.
Our generator operates in the facial landmark domain, which is a standard low-dimensional head representation.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Animating still face images with deep generative models using a speech input
signal is an active research topic and has seen important recent progress.
However, much of the effort has been put into lip syncing and rendering quality
while the generation of natural head motion, let alone the audio-visual
correlation between head motion and speech, has often been neglected. In this
work, we propose a multi-scale audio-visual synchrony loss and a multi-scale
autoregressive GAN to better handle short and long-term correlation between
speech and the dynamics of the head and lips. In particular, we train a stack
of syncer models on multimodal input pyramids and use these models as guidance
in a multi-scale generator network to produce audio-aligned motion unfolding
over diverse time scales. Our generator operates in the facial landmark domain,
which is a standard low-dimensional head representation. The experiments show
significant improvements over the state of the art in head motion dynamics
quality and in multi-scale audio-visual synchrony both in the landmark domain
and in the image domain.
Related papers
- Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation [22.159117464397806]
We introduce a two-stage diffusion-based model for talking head generation.
The first stage involves generating synchronized facial landmarks based on the given speech.
In the second stage, these generated landmarks serve as a condition in the denoising process, aiming to optimize mouth jitter issues and generate high-fidelity, well-synchronized, and temporally coherent talking head videos.
arXiv Detail & Related papers (2024-08-03T10:19:38Z) - EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting [95.44545809256473]
EgoGaussian is a method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone.
We show significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art.
arXiv Detail & Related papers (2024-06-28T10:39:36Z) - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z) - From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal.
To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z) - VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D
Hybrid Prior [28.737324182301652]
We propose a two-stage generic framework that supports generating high-visual quality talking head videos.
In the first stage, we map the audio to mesh by learning two motions, including non-rigid expression motion and rigid head motion.
In the second stage, we proposed a dual branch motion-vae and a generator to transform the meshes into dense motion and synthesize high-quality video frame-by-frame.
arXiv Detail & Related papers (2023-12-04T12:25:37Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - DFA-NeRF: Personalized Talking Head Generation via Disentangled Face
Attributes Neural Rendering [69.9557427451339]
We propose a framework based on neural radiance field to pursue high-fidelity talking head generation.
Specifically, neural radiance field takes lip movements features and personalized attributes as two disentangled conditions.
We show that our method achieves significantly better results than state-of-the-art methods.
arXiv Detail & Related papers (2022-01-03T18:23:38Z) - Audio2Head: Audio-driven One-shot Talking-head Generation with Natural
Head Motion [34.406907667904996]
We propose an audio-driven talking-head method to generate photo-realistic talking-head videos from a single reference image.
We first design a head pose predictor by modeling rigid 6D head movements with a motion-aware recurrent neural network (RNN)
Then, we develop a motion field generator to produce the dense motion fields from input audio, head poses, and a reference image.
arXiv Detail & Related papers (2021-07-20T07:22:42Z) - Multi Modal Adaptive Normalization for Audio to Video Generation [18.812696623555855]
We propose a multi-modal adaptive normalization(MAN) based architecture to synthesize a talking person video of arbitrary length using as input: an audio signal and a single image of a person.
The architecture uses the multi-modal adaptive normalization, keypoint heatmap predictor, optical flow predictor and class activation map[58] based layers to learn movements of expressive facial components.
arXiv Detail & Related papers (2020-12-14T07:39:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.