Related papers: Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning

Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning

URL: http://arxiv.org/abs/2511.14249v1
Date: Tue, 18 Nov 2025 08:39:44 GMT
Title: Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning
Authors: Rui Liu, Yuan Zhao, Zhenqi Jia,
Abstract summary: We propose a new Retrieve-Augmented Director-Actor Interaction Learning scheme to achieve authentic movie dubbing.<n>We construct a multimodal Reference Footage library to simulate the learning footage provided by directors.<n>An Emotion-Similarity-based Retrieval-Augmentation strategy retrieves the most relevant multimodal information that aligns with the target silent video.
Score: 11.98494175240752
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The automatic movie dubbing model generates vivid speech from given scripts, replicating a speaker's timbre from a brief timbre prompt while ensuring lip-sync with the silent video. Existing approaches simulate a simplified workflow where actors dub directly without preparation, overlooking the critical director-actor interaction. In contrast, authentic workflows involve a dynamic collaboration: directors actively engage with actors, guiding them to internalize the context cues, specifically emotion, before performance. To address this issue, we propose a new Retrieve-Augmented Director-Actor Interaction Learning scheme to achieve authentic movie dubbing, termed Authentic-Dubber, which contains three novel mechanisms: (1) We construct a multimodal Reference Footage library to simulate the learning footage provided by directors. Note that we integrate Large Language Models (LLMs) to achieve deep comprehension of emotional representations across multimodal signals. (2) To emulate how actors efficiently and comprehensively internalize director-provided footage during dubbing, we propose an Emotion-Similarity-based Retrieval-Augmentation strategy. This strategy retrieves the most relevant multimodal information that aligns with the target silent video. (3) We develop a Progressive Graph-based speech generation approach that incrementally incorporates the retrieved multimodal emotional knowledge, thereby simulating the actor's final dubbing process. The above mechanisms enable the Authentic-Dubber to faithfully replicate the authentic dubbing workflow, achieving comprehensive improvements in emotional expressiveness. Both subjective and objective evaluations on the V2C Animation benchmark dataset validate the effectiveness. The code and demos are available at https://github.com/AI-S2-Lab/Authentic-Dubber.

Related papers

FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes [56.534404169212785]
FunCineForge comprises an end-to-end production pipeline for large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes.<n>We construct the first Chinese television dubbing dataset with rich annotations, and demonstrate the high quality of these data.<n>Experiments across monologue, narration, dialogue, and multi-speaker scenes show that our dubbing model consistently outperforms SOTA methods in audio quality, lip sync, timbre transfer, and instruction following.
arXiv Detail & Related papers (2026-01-21T08:57:00Z)
Bridging Your Imagination with Audio-Video Generation via a Unified Director [54.45375287950375]
We argue that logical reasoning and imaginative thinking are both fundamental qualities of a film director.<n>We propose UniMAGE, a unified director model that bridges user prompts with well-structured scripts.
arXiv Detail & Related papers (2025-12-29T05:56:22Z)
MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing [12.954750400557344]
We introduce a multi-modal generative framework for movie dubbing.<n>It produces high-quality dubbing using large speech generation models, guided by multi-modal inputs.<n>Results show superior performance compared to state-of-the-art (SOTA) methods.
arXiv Detail & Related papers (2025-05-22T06:23:05Z)
FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing [81.3306413498174]
Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects.<n>Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality.<n>We propose a large language model (LLM) based flow matching architecture for dubbing, named FlowDubber.
arXiv Detail & Related papers (2025-05-02T13:30:19Z)
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models [43.1613638989795]
We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues.<n>This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals.
arXiv Detail & Related papers (2025-04-03T08:24:47Z)
Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing [60.38045088180188]
We propose an acoustic-prosody disentangled two-stage method to achieve high-quality dubbing generation with precise prosody alignment.<n>We incorporate an in-domain emotion analysis module to reduce the impact of visual domain shifts across different movies.<n>Our method performs favorably against the state-of-the-art models on two primary benchmarks.
arXiv Detail & Related papers (2025-03-15T08:25:57Z)
StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing [125.86266166482704]
We propose StyleDubber, which switches dubbing learning from the frame level to phoneme level. It contains three main components: (1) A multimodal style adaptor operating at the phoneme level to learn pronunciation style from the reference audio, and generate intermediate representations informed by the facial emotion presented in the video; (2) An utterance-level style learning module, which guides both the mel-spectrogram decoding and the refining processes from the intermediate embeddings to improve the overall style expression; and (3) a phoneme-guided lip aligner to maintain lip sync.
arXiv Detail & Related papers (2024-02-20T01:28:34Z)
Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks. We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment. Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.