Audio-Driven Dubbing for User Generated Contents via Style-Aware
Semi-Parametric Synthesis
- URL: http://arxiv.org/abs/2309.00030v1
- Date: Thu, 31 Aug 2023 15:41:40 GMT
- Title: Audio-Driven Dubbing for User Generated Contents via Style-Aware
Semi-Parametric Synthesis
- Authors: Linsen Song, Wayne Wu, Chaoyou Fu, Chen Change Loy, Ran He
- Abstract summary: Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production.
In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production.
- Score: 123.11530365315677
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Existing automated dubbing methods are usually designed for Professionally
Generated Content (PGC) production, which requires massive training data and
training time to learn a person-specific audio-video mapping. In this paper, we
investigate an audio-driven dubbing method that is more feasible for User
Generated Content (UGC) production. There are two unique challenges to design a
method for UGC: 1) the appearances of speakers are diverse and arbitrary as the
method needs to generalize across users; 2) the available video data of one
speaker are very limited. In order to tackle the above challenges, we first
introduce a new Style Translation Network to integrate the speaking style of
the target and the speaking content of the source via a cross-modal AdaIN
module. It enables our model to quickly adapt to a new speaker. Then, we
further develop a semi-parametric video renderer, which takes full advantage of
the limited training data of the unseen speaker via a video-level
retrieve-warp-refine pipeline. Finally, we propose a temporal regularization
for the semi-parametric renderer, generating more continuous videos. Extensive
experiments show that our method generates videos that accurately preserve
various speaking styles, yet with considerably lower amount of training data
and training time in comparison to existing methods. Besides, our method
achieves a faster testing speed than most recent methods.
Related papers
- Read, Watch and Scream! Sound Generation from Text and Video [23.990569918960315]
We propose a novel video-and-text-to-sound generation method called ReWaS.
Our method estimates the structural information of audio from the video while receiving key content cues from a user prompt.
By separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences.
arXiv Detail & Related papers (2024-07-08T01:59:17Z) - Unified Video-Language Pre-training with Synchronized Audio [21.607860535968356]
We propose an enhanced framework for Video-Language pre-training with Synchronized Audio.
Our framework learns tri-modal representations in a unified self-supervised transformer.
Our model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-12T07:59:46Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.