Audio-Visual Face Reenactment
- URL: http://arxiv.org/abs/2210.02755v1
- Date: Thu, 6 Oct 2022 08:48:10 GMT
- Title: Audio-Visual Face Reenactment
- Authors: Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, C V Jawahar
- Abstract summary: This work proposes a novel method to generate realistic talking head videos using audio and visual streams.
We animate a source image by transferring head motion from a driving video using a dense motion field generated using learnable keypoints.
We improve the quality of lip sync using audio as an additional input, helping the network to attend to the mouth region.
- Score: 34.79242760137663
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This work proposes a novel method to generate realistic talking head videos
using audio and visual streams. We animate a source image by transferring head
motion from a driving video using a dense motion field generated using
learnable keypoints. We improve the quality of lip sync using audio as an
additional input, helping the network to attend to the mouth region. We use
additional priors using face segmentation and face mesh to improve the
structure of the reconstructed faces. Finally, we improve the visual quality of
the generations by incorporating a carefully designed identity-aware generator
module. The identity-aware generator takes the source image and the warped
motion features as input to generate a high-quality output with fine-grained
details. Our method produces state-of-the-art results and generalizes well to
unseen faces, languages, and voices. We comprehensively evaluate our approach
using multiple metrics and outperforming the current techniques both
qualitative and quantitatively. Our work opens up several applications,
including enabling low bandwidth video calls. We release a demo video and
additional information at
http://cvit.iiit.ac.in/research/projects/cvit-projects/avfr.
Related papers
- JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation [24.2065254076207]
We introduce a novel method for joint expression and audio-guided talking face generation.
Our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer.
arXiv Detail & Related papers (2024-09-18T17:18:13Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - HyperLips: Hyper Control Lips with High Resolution Decoder for Talking
Face Generation [21.55822398346139]
HyperLips is a two-stage framework consisting of a hypernetwork for controlling lips and a high-resolution decoder for rendering high-fidelity faces.
In the first stage, we construct a base face generation network that uses the hypernetwork to control the encoding latent code of the visual face information over audio.
In the second stage, we obtain higher quality face videos through a high-resolution decoder.
arXiv Detail & Related papers (2023-10-09T13:45:21Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in
Transformers [91.00397473678088]
Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions.
We propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality.
Our model can generate high-fidelity lip-synced results for arbitrary subjects.
arXiv Detail & Related papers (2022-12-09T16:32:46Z) - One-shot Talking Face Generation from Single-speaker Audio-Visual
Correlation Learning [20.51814865676907]
It would be much easier to learn a consistent speech style from a specific speaker, which leads to authentic mouth movements.
We propose a novel one-shot talking face generation framework by exploring consistent correlations between audio and visual motions from a specific speaker.
Thanks to our learned consistent speaking style, our method generates authentic mouth shapes and vivid movements.
arXiv Detail & Related papers (2021-12-06T02:53:51Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z) - MeshTalk: 3D Face Animation from Speech using Cross-Modality
Disentanglement [142.9900055577252]
We propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face.
Our approach ensures highly accurate lip motion, while also plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion.
arXiv Detail & Related papers (2021-04-16T17:05:40Z) - Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model [96.24038430433885]
We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
arXiv Detail & Related papers (2021-03-29T09:09:39Z) - Robust One Shot Audio to Video Generation [10.957973845883162]
OneShotA2V is a novel approach to synthesize a talking person video of arbitrary length using as input: an audio signal and a single unseen image of a person.
OneShotA2V leverages curriculum learning to learn movements of expressive facial components and hence generates a high-quality talking-head video of the given person.
arXiv Detail & Related papers (2020-12-14T10:50:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.