MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions
- URL: http://arxiv.org/abs/2307.10008v1
- Date: Wed, 19 Jul 2023 14:45:11 GMT
- Title: MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions
- Authors: Yunfei Liu, Lijian Lin, Fei Yu, Changyin Zhou, Yu Li
- Abstract summary: We propose a unified system for multi-person, diverse, and high-fidelity talking portrait generation.
Our method contains three stages, i.e., 1) Mapping-Once network with Dual Attentions (MODA) generates talking representation from given audio.
The proposed system produces more natural and realistic video portraits compared to previous methods.
- Score: 15.626317162430087
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-driven portrait animation aims to synthesize portrait videos that are
conditioned by given audio. Animating high-fidelity and multimodal video
portraits has a variety of applications. Previous methods have attempted to
capture different motion modes and generate high-fidelity portrait videos by
training different models or sampling signals from given videos. However,
lacking correlation learning between lip-sync and other movements (e.g., head
pose/eye blinking) usually leads to unnatural results. In this paper, we
propose a unified system for multi-person, diverse, and high-fidelity talking
portrait generation. Our method contains three stages, i.e., 1) Mapping-Once
network with Dual Attentions (MODA) generates talking representation from given
audio. In MODA, we design a dual-attention module to encode accurate mouth
movements and diverse modalities. 2) Facial composer network generates dense
and detailed face landmarks, and 3) temporal-guided renderer syntheses stable
videos. Extensive evaluations demonstrate that the proposed system produces
more natural and realistic video portraits compared to previous methods.
Related papers
- One-Shot Pose-Driving Face Animation Platform [7.422568903818486]
We refine an existing Image2Video model by integrating a Face Locator and Motion Frame mechanism.
We optimize the model using extensive human face video datasets, significantly enhancing its ability to produce high-quality talking head videos.
We develop a demo platform using the Gradio framework, which streamlines the process, enabling users to quickly create customized talking head videos.
arXiv Detail & Related papers (2024-07-12T03:09:07Z) - AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation [4.568539181254851]
We propose AniPortrait, a framework for generating high-quality animation driven by audio and a reference portrait image.
Experimental results demonstrate the superiority of AniPortrait in terms of facial naturalness, pose diversity, and visual quality.
Our methodology exhibits considerable potential in terms of flexibility and controllability, which can be effectively applied in areas such as facial motion editing or face reenactment.
arXiv Detail & Related papers (2024-03-26T13:35:02Z) - Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis [88.17520303867099]
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio.
We present Real3D-Potrait, a framework that improves the one-shot 3D reconstruction power with a large image-to-plane model.
Experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos.
arXiv Detail & Related papers (2024-01-16T17:04:30Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - ReliTalk: Relightable Talking Portrait Generation from a Single Video [62.47116237654984]
ReliTalk is a novel framework for relightable audio-driven talking portrait generation from monocular videos.
Our key insight is to decompose the portrait's reflectance from implicitly learned audio-driven facial normals and images.
arXiv Detail & Related papers (2023-09-05T17:59:42Z) - VideoReTalking: Audio-based Lip Synchronization for Talking Head Video
Editing In the Wild [37.93856291026653]
VideoReTalking is a new system to edit the faces of a real-world talking head video according to input audio.
It produces a high-quality and lip-syncing output video even with a different emotion.
arXiv Detail & Related papers (2022-11-27T08:14:23Z) - Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation [61.8546794105462]
We propose Semantic-aware Speaking Portrait NeRF (SSP-NeRF), which creates delicate audio-driven portraits using one unified set of NeRF.
We first propose a Semantic-Aware Dynamic Ray Sampling module with an additional parsing branch that facilitates audio-driven volume rendering.
To enable portrait rendering in one unified neural radiance field, a Torso Deformation module is designed to stabilize the large-scale non-rigid torso motions.
arXiv Detail & Related papers (2022-01-19T18:54:41Z) - Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model [96.24038430433885]
We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
arXiv Detail & Related papers (2021-03-29T09:09:39Z) - Robust One Shot Audio to Video Generation [10.957973845883162]
OneShotA2V is a novel approach to synthesize a talking person video of arbitrary length using as input: an audio signal and a single unseen image of a person.
OneShotA2V leverages curriculum learning to learn movements of expressive facial components and hence generates a high-quality talking-head video of the given person.
arXiv Detail & Related papers (2020-12-14T10:50:05Z) - Multi Modal Adaptive Normalization for Audio to Video Generation [18.812696623555855]
We propose a multi-modal adaptive normalization(MAN) based architecture to synthesize a talking person video of arbitrary length using as input: an audio signal and a single image of a person.
The architecture uses the multi-modal adaptive normalization, keypoint heatmap predictor, optical flow predictor and class activation map[58] based layers to learn movements of expressive facial components.
arXiv Detail & Related papers (2020-12-14T07:39:45Z) - Audio-driven Talking Face Video Generation with Learning-based
Personalized Head Pose [67.31838207805573]
We propose a deep neural network model that takes an audio signal A of a source person and a short video V of a target person as input.
We outputs a synthesized high-quality talking face video with personalized head pose.
Our method can generate high-quality talking face videos with more distinguishing head movement effects than state-of-the-art methods.
arXiv Detail & Related papers (2020-02-24T10:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.