RADIO: Reference-Agnostic Dubbing Video Synthesis
- URL: http://arxiv.org/abs/2309.01950v2
- Date: Mon, 6 Nov 2023 05:06:54 GMT
- Title: RADIO: Reference-Agnostic Dubbing Video Synthesis
- Authors: Dongyeun Lee, Chaewon Kim, Sangjoon Yu, Jaejun Yoo, Gyeong-Moon Park
- Abstract summary: Given only a single reference image, extracting meaningful identity attributes becomes even more challenging.
We introduce RADIO, a framework engineered to yield high-quality dubbed videos regardless of the pose or expression in reference images.
Our experimental results demonstrate that RADIO displays high synchronization without the loss of fidelity.
- Score: 12.872464331012544
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: One of the most challenging problems in audio-driven talking head generation
is achieving high-fidelity detail while ensuring precise synchronization. Given
only a single reference image, extracting meaningful identity attributes
becomes even more challenging, often causing the network to mirror the facial
and lip structures too closely. To address these issues, we introduce RADIO, a
framework engineered to yield high-quality dubbed videos regardless of the pose
or expression in reference images. The key is to modulate the decoder layers
using latent space composed of audio and reference features. Additionally, we
incorporate ViT blocks into the decoder to emphasize high-fidelity details,
especially in the lip region. Our experimental results demonstrate that RADIO
displays high synchronization without the loss of fidelity. Especially in harsh
scenarios where the reference frame deviates significantly from the ground
truth, our method outperforms state-of-the-art methods, highlighting its
robustness.
Related papers
- MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting [12.852715177163608]
MuseTalk generates lip-sync targets in a latent space encoded by a Variational Autoencoder.
It supports the online generation of face at 256x256 at more than 30 FPS with negligible starting latency.
arXiv Detail & Related papers (2024-10-14T03:22:26Z) - LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details [14.22392871407274]
We present an effective post-processing approach to synthesize photo-realistic talking head videos.
Specifically, we employ a pretrained Wav2Lip model as our foundation model, leveraging its robust audio-lip alignment capabilities.
Results indicate that our method, LaDTalk, achieves new state-of-the-art video quality and out-of-domain lip synchronization performance.
arXiv Detail & Related papers (2024-10-01T18:32:02Z) - SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space [13.59798532129008]
We propose an innovative unified framework, SwapTalk, which accomplishes both face swapping and lip synchronization tasks in the same latent space.
We introduce a novel identity consistency metric to more comprehensively assess the identity consistency over time series in generated facial videos.
Experimental results on the HDTF demonstrate that our method significantly surpasses existing techniques in video quality, lip synchronization accuracy, face swapping fidelity, and identity consistency.
arXiv Detail & Related papers (2024-05-09T09:22:09Z) - DiffDub: Person-generic Visual Dubbing Using Inpainting Renderer with
Diffusion Auto-encoder [21.405442790474268]
We propose DiffDub: Diffusion-based dubbing.
We first craft the Diffusion auto-encoder by an inpainting incorporating a mask to delineate editable zones and unaltered regions.
To tackle these issues, we employ versatile strategies, including data augmentation and supplementary eye guidance.
arXiv Detail & Related papers (2023-11-03T09:41:51Z) - ReliTalk: Relightable Talking Portrait Generation from a Single Video [62.47116237654984]
ReliTalk is a novel framework for relightable audio-driven talking portrait generation from monocular videos.
Our key insight is to decompose the portrait's reflectance from implicitly learned audio-driven facial normals and images.
arXiv Detail & Related papers (2023-09-05T17:59:42Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face
Synthesis [62.297513028116576]
GeneFace is a general and high-fidelity NeRF-based talking face generation method.
A head-aware torso-NeRF is proposed to eliminate the head-torso problem.
arXiv Detail & Related papers (2023-01-31T05:56:06Z) - DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven
Portraits Animation [78.08004432704826]
We model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk)
In this paper, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis.
Our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost.
arXiv Detail & Related papers (2023-01-10T05:11:25Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.