Learning to Dub Movies via Hierarchical Prosody Models
- URL: http://arxiv.org/abs/2212.04054v2
- Date: Tue, 4 Apr 2023 11:33:30 GMT
- Title: Learning to Dub Movies via Hierarchical Prosody Models
- Authors: Gaoxiang Cong, Liang Li, Yuankai Qi, Zhengjun Zha, Qi Wu, Wenyu Wang,
Bin Jiang, Ming-Hsuan Yang, Qingming Huang
- Abstract summary: Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference.
We propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene.
- Score: 167.6465354313349
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given a piece of text, a video clip and a reference audio, the movie dubbing
(also known as visual voice clone V2C) task aims to generate speeches that
match the speaker's emotion presented in the video using the desired speaker
voice as reference. V2C is more challenging than conventional text-to-speech
tasks as it additionally requires the generated speech to exactly match the
varying emotions and speaking speed presented in the video. Unlike previous
works, we propose a novel movie dubbing architecture to tackle these problems
via hierarchical prosody modelling, which bridges the visual information to
corresponding speech prosody from three aspects: lip, face, and scene.
Specifically, we align lip movement to the speech duration, and convey facial
expression to speech energy and pitch via attention mechanism based on valence
and arousal representations inspired by recent psychology findings. Moreover,
we design an emotion booster to capture the atmosphere from global video
scenes. All these embeddings together are used to generate mel-spectrogram and
then convert to speech waves via existing vocoder. Extensive experimental
results on the Chem and V2C benchmark datasets demonstrate the favorable
performance of the proposed method. The source code and trained models will be
released to the public.
Related papers
- StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing [125.86266166482704]
We propose StyleDubber, which switches dubbing learning from the frame level to phoneme level.
It contains three main components: (1) A multimodal style adaptor operating at the phoneme level to learn pronunciation style from the reference audio, and generate intermediate representations informed by the facial emotion presented in the video; (2) An utterance-level style learning module, which guides both the mel-spectrogram decoding and the refining processes from the intermediate embeddings to improve the overall style expression; and (3) a phoneme-guided lip aligner to maintain lip sync.
arXiv Detail & Related papers (2024-02-20T01:28:34Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z) - VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via
Speech-Visage Feature Selection [32.65865343643458]
Recent studies have shown impressive performance on synthesizing speech from silent talking face videos.
We introduce speech-visage selection module that separates the speech content and the speaker identity from the visual features of the input video.
Proposed framework brings the advantage of synthesizing the speech containing the right content even when the silent talking face video of an unseen subject is given.
arXiv Detail & Related papers (2022-06-15T11:29:58Z) - V2C: Visual Voice Cloning [55.55301826567474]
We propose a new task named Visual Voice Cloning (V2C)
V2C seeks to convert a paragraph of text to a speech with both desired voice specified by a reference audio and desired emotion specified by a reference video.
Our dataset contains 10,217 animated movie clips covering a large variety of genres.
arXiv Detail & Related papers (2021-11-25T03:35:18Z) - Speech2Video: Cross-Modal Distillation for Speech to Video Generation [21.757776580641902]
Speech-to-video generation technique can spark interesting applications in entertainment, customer service, and human-computer-interaction industries.
The challenge mainly lies in disentangling the distinct visual attributes from audio signals.
We propose a light-weight, cross-modal distillation method to extract disentangled emotional and identity information from unlabelled video inputs.
arXiv Detail & Related papers (2021-07-10T10:27:26Z) - Write-a-speaker: Text-based Emotional and Rhythmic Talking-head
Generation [28.157431757281692]
We propose a text-based talking-head video generation framework that synthesizes high-fidelity facial expressions and head motions.
Our framework consists of a speaker-independent stage and a speaker-specific stage.
Our algorithm achieves high-quality photo-realistic talking-head videos including various facial expressions and head motions according to speech rhythms.
arXiv Detail & Related papers (2021-04-16T09:44:12Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.