Robust One Shot Audio to Video Generation
- URL: http://arxiv.org/abs/2012.07842v1
- Date: Mon, 14 Dec 2020 10:50:05 GMT
- Title: Robust One Shot Audio to Video Generation
- Authors: Neeraj Kumar, Srishti Goel, Ankur Narang, Mujtaba Hasan
- Abstract summary: OneShotA2V is a novel approach to synthesize a talking person video of arbitrary length using as input: an audio signal and a single unseen image of a person.
OneShotA2V leverages curriculum learning to learn movements of expressive facial components and hence generates a high-quality talking-head video of the given person.
- Score: 10.957973845883162
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Audio to Video generation is an interesting problem that has numerous
applications across industry verticals including film making, multi-media,
marketing, education and others. High-quality video generation with expressive
facial movements is a challenging problem that involves complex learning steps
for generative adversarial networks. Further, enabling one-shot learning for an
unseen single image increases the complexity of the problem while
simultaneously making it more applicable to practical scenarios. In the paper,
we propose a novel approach OneShotA2V to synthesize a talking person video of
arbitrary length using as input: an audio signal and a single unseen image of a
person. OneShotA2V leverages curriculum learning to learn movements of
expressive facial components and hence generates a high-quality talking-head
video of the given person. Further, it feeds the features generated from the
audio input directly into a generative adversarial network and it adapts to any
given unseen selfie by applying fewshot learning with only a few output
updation epochs. OneShotA2V leverages spatially adaptive normalization based
multi-level generator and multiple multi-level discriminators based
architecture. The input audio clip is not restricted to any specific language,
which gives the method multilingual applicability. Experimental evaluation
demonstrates superior performance of OneShotA2V as compared to Realistic
Speech-Driven Facial Animation with GANs(RSDGAN) [43], Speech2Vid [8], and
other approaches, on multiple quantitative metrics including: SSIM (structural
similarity index), PSNR (peak signal to noise ratio) and CPBD (image
sharpness). Further, qualitative evaluation and Online Turing tests demonstrate
the efficacy of our approach.
Related papers
- Towards Multi-Task Multi-Modal Models: A Video Generative Perspective [5.495245220300184]
This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions.
We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms.
Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
arXiv Detail & Related papers (2024-05-26T23:56:45Z) - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis [66.43223397997559]
We aim to synthesize high-quality talking portrait videos corresponding to the input text.
This task has broad application prospects in the digital human industry but has not been technically achieved yet.
We introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which designs a generic zero-shot multi-speaker Text-to-Speech model.
arXiv Detail & Related papers (2023-06-06T08:50:13Z) - Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild [44.92322575562816]
We propose a VAE-GAN architecture that learns to associate the lip and speech sequences amidst the variations.
Our generator learns to synthesize speech in any voice for the lip sequences of any person.
We conduct numerous ablation studies to analyze the effect of different modules of our architecture.
arXiv Detail & Related papers (2022-09-01T17:50:29Z) - Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z) - One Shot Audio to Animated Video Generation [15.148595295859659]
We propose a novel method to generate an animated video of arbitrary length using an audio clip and a single unseen image of a person as an input.
OneShotAu2AV can generate animated videos that have: (a) lip movements that are in sync with the audio, (b) natural facial expressions such as blinks and eyebrow movements, (c) head movements.
arXiv Detail & Related papers (2021-02-19T04:29:17Z) - Multi Modal Adaptive Normalization for Audio to Video Generation [18.812696623555855]
We propose a multi-modal adaptive normalization(MAN) based architecture to synthesize a talking person video of arbitrary length using as input: an audio signal and a single image of a person.
The architecture uses the multi-modal adaptive normalization, keypoint heatmap predictor, optical flow predictor and class activation map[58] based layers to learn movements of expressive facial components.
arXiv Detail & Related papers (2020-12-14T07:39:45Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.