Related papers: StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing

StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing

URL: http://arxiv.org/abs/2402.12636v3
Date: Tue, 2 Jul 2024 03:42:36 GMT
Title: StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing
Authors: Gaoxiang Cong, Yuankai Qi, Liang Li, Amin Beheshti, Zhedong Zhang, Anton van den Hengel, Ming-Hsuan Yang, Chenggang Yan, Qingming Huang,
Abstract summary: We propose StyleDubber, which switches dubbing learning from the frame level to phoneme level. It contains three main components: (1) A multimodal style adaptor operating at the phoneme level to learn pronunciation style from the reference audio, and generate intermediate representations informed by the facial emotion presented in the video; (2) An utterance-level style learning module, which guides both the mel-spectrogram decoding and the refining processes from the intermediate embeddings to improve the overall style expression; and (3) a phoneme-guided lip aligner to maintain lip sync.
Score: 125.86266166482704
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given a script, the challenge in Movie Dubbing (Visual Voice Cloning, V2C) is to generate speech that aligns well with the video in both time and emotion, based on the tone of a reference audio track. Existing state-of-the-art V2C models break the phonemes in the script according to the divisions between video frames, which solves the temporal alignment problem but leads to incomplete phoneme pronunciation and poor identity stability. To address this problem, we propose StyleDubber, which switches dubbing learning from the frame level to phoneme level. It contains three main components: (1) A multimodal style adaptor operating at the phoneme level to learn pronunciation style from the reference audio, and generate intermediate representations informed by the facial emotion presented in the video; (2) An utterance-level style learning module, which guides both the mel-spectrogram decoding and the refining processes from the intermediate embeddings to improve the overall style expression; And (3) a phoneme-guided lip aligner to maintain lip sync. Extensive experiments on two of the primary benchmarks, V2C and Grid, demonstrate the favorable performance of the proposed method as compared to the current stateof-the-art. The code will be made available at https://github.com/GalaxyCong/StyleDubber.

Related papers

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models [43.1613638989795]
We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals.
arXiv Detail & Related papers (2025-04-03T08:24:47Z)
Contextual Gesture: Co-Speech Gesture Video Generation through Context-aware Gesture Representation [11.838249135550662]
Contextual Gesture is a framework that improves co-speech gesture video generation through three innovative components. Our experiments demonstrate that Contextual Gesture not only produces realistic and speech-aligned gesture videos but also supports long-sequence generation and video gesture editing applications.
arXiv Detail & Related papers (2025-02-11T04:09:12Z)
Unified Video-Language Pre-training with Synchronized Audio [21.607860535968356]
We propose an enhanced framework for Video-Language pre-training with Synchronized Audio. Our framework learns tri-modal representations in a unified self-supervised transformer. Our model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-12T07:59:46Z)
Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis [123.11530365315677]
Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production. In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production.
arXiv Detail & Related papers (2023-08-31T15:41:40Z)
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions. Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z)
Learning to Dub Movies via Hierarchical Prosody Models [167.6465354313349]
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference. We propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene.
arXiv Detail & Related papers (2022-12-08T03:29:04Z)
Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation. Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics. We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z)
VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild [37.93856291026653]
VideoReTalking is a new system to edit the faces of a real-world talking head video according to input audio. It produces a high-quality and lip-syncing output video even with a different emotion.
arXiv Detail & Related papers (2022-11-27T08:14:23Z)
V2C: Visual Voice Cloning [55.55301826567474]
We propose a new task named Visual Voice Cloning (V2C) V2C seeks to convert a paragraph of text to a speech with both desired voice specified by a reference audio and desired emotion specified by a reference video. Our dataset contains 10,217 animated movie clips covering a large variety of genres.
arXiv Detail & Related papers (2021-11-25T03:35:18Z)
Neural Dubber: Dubbing for Silent Videos According to Scripts [22.814626504851752]
We propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task. Neural Dubber is a multi-modal text-to-speech model that utilizes the lip movement in the video to control the prosody of the generated speech. Experiments show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.
arXiv Detail & Related papers (2021-10-15T17:56:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.