One Shot Audio to Animated Video Generation
- URL: http://arxiv.org/abs/2102.09737v1
- Date: Fri, 19 Feb 2021 04:29:17 GMT
- Title: One Shot Audio to Animated Video Generation
- Authors: Neeraj Kumar, Srishti Goel, Ankur Narang, Brejesh Lall, Mujtaba Hasan,
Pranshu Agarwal, Dipankar Sarkar
- Abstract summary: We propose a novel method to generate an animated video of arbitrary length using an audio clip and a single unseen image of a person as an input.
OneShotAu2AV can generate animated videos that have: (a) lip movements that are in sync with the audio, (b) natural facial expressions such as blinks and eyebrow movements, (c) head movements.
- Score: 15.148595295859659
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We consider the challenging problem of audio to animated video generation. We
propose a novel method OneShotAu2AV to generate an animated video of arbitrary
length using an audio clip and a single unseen image of a person as an input.
The proposed method consists of two stages. In the first stage, OneShotAu2AV
generates the talking-head video in the human domain given an audio and a
person's image. In the second stage, the talking-head video from the human
domain is converted to the animated domain. The model architecture of the first
stage consists of spatially adaptive normalization based multi-level generator
and multiple multilevel discriminators along with multiple adversarial and
non-adversarial losses. The second stage leverages attention based
normalization driven GAN architecture along with temporal predictor based
recycle loss and blink loss coupled with lipsync loss, for unsupervised
generation of animated video. In our approach, the input audio clip is not
restricted to any specific language, which gives the method multilingual
applicability. OneShotAu2AV can generate animated videos that have: (a) lip
movements that are in sync with the audio, (b) natural facial expressions such
as blinks and eyebrow movements, (c) head movements. Experimental evaluation
demonstrates superior performance of OneShotAu2AV as compared to U-GAT-IT and
RecycleGan on multiple quantitative metrics including KID(Kernel Inception
Distance), Word error rate, blinks/sec
Related papers
- UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation [53.16986875759286]
We present a UniAnimate framework to enable efficient and long-term human video generation.
We map the reference image along with the posture guidance and noise video into a common feature space.
We also propose a unified noise input that supports random noised input as well as first frame conditioned input.
arXiv Detail & Related papers (2024-06-03T10:51:10Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization [38.64540967776744]
Diff2Lip is an audio-conditioned diffusion-based model which is able to do lip synchronization in-the-wild while preserving these qualities.
We show results on both reconstruction (same audio-video inputs) as well as cross (different audio-video inputs) settings on Voxceleb2 and LRW datasets.
arXiv Detail & Related papers (2023-08-18T17:59:40Z) - Learning to Dub Movies via Hierarchical Prosody Models [167.6465354313349]
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference.
We propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene.
arXiv Detail & Related papers (2022-12-08T03:29:04Z) - Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation [47.06075725469252]
StyleTalker is an audio-driven talking head generation model.
It can synthesize a video of a talking person from a single reference image.
Our model is able to synthesize talking head videos with impressive perceptual quality.
arXiv Detail & Related papers (2022-08-23T12:49:01Z) - FaceFormer: Speech-Driven 3D Facial Animation with Transformers [46.8780140220063]
Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data.
We propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes.
arXiv Detail & Related papers (2021-12-10T04:21:59Z) - Robust One Shot Audio to Video Generation [10.957973845883162]
OneShotA2V is a novel approach to synthesize a talking person video of arbitrary length using as input: an audio signal and a single unseen image of a person.
OneShotA2V leverages curriculum learning to learn movements of expressive facial components and hence generates a high-quality talking-head video of the given person.
arXiv Detail & Related papers (2020-12-14T10:50:05Z) - Multi Modal Adaptive Normalization for Audio to Video Generation [18.812696623555855]
We propose a multi-modal adaptive normalization(MAN) based architecture to synthesize a talking person video of arbitrary length using as input: an audio signal and a single image of a person.
The architecture uses the multi-modal adaptive normalization, keypoint heatmap predictor, optical flow predictor and class activation map[58] based layers to learn movements of expressive facial components.
arXiv Detail & Related papers (2020-12-14T07:39:45Z) - Everybody's Talkin': Let Me Talk as You Want [134.65914135774605]
We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video.
It does not assume a person-specific rendering network yet capable of translating arbitrary source audio into arbitrary video output.
arXiv Detail & Related papers (2020-01-15T09:54:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.