Perceptual Conversational Head Generation with Regularized Driver and
Enhanced Renderer
- URL: http://arxiv.org/abs/2206.12837v1
- Date: Sun, 26 Jun 2022 10:12:59 GMT
- Title: Perceptual Conversational Head Generation with Regularized Driver and
Enhanced Renderer
- Authors: Ailin Huang, Zhewei Huang, Shuchang Zhou
- Abstract summary: Our solution focuses on training a generalized audio-to-head driver using regularization and assembling a high visual quality.
We get first place in the listening head generation track and second place in the talking head generation track in the official ranking.
- Score: 4.201920674650052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper reports our solution for MultiMedia ViCo 2022 Conversational Head
Generation Challenge, which aims to generate vivid face-to-face conversation
videos based on audio and reference images. Our solution focuses on training a
generalized audio-to-head driver using regularization and assembling a high
visual quality renderer. We carefully tweak the audio-to-behavior model and
post-process the generated video using our foreground-background fusion module.
We get first place in the listening head generation track and second place in
the talking head generation track in the official ranking. Our code will be
released.
Related papers
- LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details [14.22392871407274]
We present an effective post-processing approach to synthesize photo-realistic talking head videos.
Specifically, we employ a pretrained Wav2Lip model as our foundation model, leveraging its robust audio-lip alignment capabilities.
Results indicate that our method, LaDTalk, achieves new state-of-the-art video quality and out-of-domain lip synchronization performance.
arXiv Detail & Related papers (2024-10-01T18:32:02Z) - One-Shot Pose-Driving Face Animation Platform [7.422568903818486]
We refine an existing Image2Video model by integrating a Face Locator and Motion Frame mechanism.
We optimize the model using extensive human face video datasets, significantly enhancing its ability to produce high-quality talking head videos.
We develop a demo platform using the Gradio framework, which streamlines the process, enabling users to quickly create customized talking head videos.
arXiv Detail & Related papers (2024-07-12T03:09:07Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D
Hybrid Prior [28.737324182301652]
We propose a two-stage generic framework that supports generating high-visual quality talking head videos.
In the first stage, we map the audio to mesh by learning two motions, including non-rigid expression motion and rigid head motion.
In the second stage, we proposed a dual branch motion-vae and a generator to transform the meshes into dense motion and synthesize high-quality video frame-by-frame.
arXiv Detail & Related papers (2023-12-04T12:25:37Z) - Hierarchical Semantic Perceptual Listener Head Video Generation: A
High-performance Pipeline [6.9329709955764045]
ViCo@2023 Conversational Head Generation Challenge in ACM Multimedia 2023 conference.
This paper is a technical report of ViCo@2023 Conversational Head Generation Challenge in ACM Multimedia 2023 conference.
arXiv Detail & Related papers (2023-07-19T08:16:34Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face
Synthesis [62.297513028116576]
GeneFace is a general and high-fidelity NeRF-based talking face generation method.
A head-aware torso-NeRF is proposed to eliminate the head-torso problem.
arXiv Detail & Related papers (2023-01-31T05:56:06Z) - DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven
Portraits Animation [78.08004432704826]
We model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk)
In this paper, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis.
Our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost.
arXiv Detail & Related papers (2023-01-10T05:11:25Z) - Audio2Head: Audio-driven One-shot Talking-head Generation with Natural
Head Motion [34.406907667904996]
We propose an audio-driven talking-head method to generate photo-realistic talking-head videos from a single reference image.
We first design a head pose predictor by modeling rigid 6D head movements with a motion-aware recurrent neural network (RNN)
Then, we develop a motion field generator to produce the dense motion fields from input audio, head poses, and a reference image.
arXiv Detail & Related papers (2021-07-20T07:22:42Z) - AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis [55.24336227884039]
We present a novel framework to generate high-fidelity talking head video.
We use neural scene representation networks to bridge the gap between audio input and video output.
Our framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images.
arXiv Detail & Related papers (2021-03-20T02:58:13Z) - Audio-driven Talking Face Video Generation with Learning-based
Personalized Head Pose [67.31838207805573]
We propose a deep neural network model that takes an audio signal A of a source person and a short video V of a target person as input.
We outputs a synthesized high-quality talking face video with personalized head pose.
Our method can generate high-quality talking face videos with more distinguishing head movement effects than state-of-the-art methods.
arXiv Detail & Related papers (2020-02-24T10:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.