Related papers: FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes

FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes

URL: http://arxiv.org/abs/2601.14777v1
Date: Wed, 21 Jan 2026 08:57:00 GMT
Title: FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes
Authors: Jiaxuan Liu, Yang Xiang, Han Zhao, Xiangang Li, Zhenhua Ling,
Abstract summary: FunCineForge comprises an end-to-end production pipeline for large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes.<n>We construct the first Chinese television dubbing dataset with rich annotations, and demonstrate the high quality of these data.<n>Experiments across monologue, narration, dialogue, and multi-speaker scenes show that our dubbing model consistently outperforms SOTA methods in audio quality, lip sync, timbre transfer, and instruction following.
Score: 56.534404169212785
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Movie dubbing is the task of synthesizing speech from scripts conditioned on video scenes, requiring accurate lip sync, faithful timbre transfer, and proper modeling of character identity and emotion. However, existing methods face two major limitations: (1) high-quality multimodal dubbing datasets are limited in scale, suffer from high word error rates, contain sparse annotations, rely on costly manual labeling, and are restricted to monologue scenes, all of which hinder effective model training; (2) existing dubbing models rely solely on the lip region to learn audio-visual alignment, which limits their applicability to complex live-action cinematic scenes, and exhibit suboptimal performance in lip sync, speech quality, and emotional expressiveness. To address these issues, we propose FunCineForge, which comprises an end-to-end production pipeline for large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes. Using the pipeline, we construct the first Chinese television dubbing dataset with rich annotations, and demonstrate the high quality of these data. Experiments across monologue, narration, dialogue, and multi-speaker scenes show that our dubbing model consistently outperforms SOTA methods in audio quality, lip sync, timbre transfer, and instruction following. Code and demos are available at https://anonymous.4open.science/w/FunCineForge.

Related papers

JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion [47.70095297438178]
We introduce a single-model approach that adapts an audio-video diffusion model for video-to-video dubbing via a lightweight LoRA.<n>We generate multilingual videos with language switches within a single clip, and then inpaint the face and audio in each half to match the language of the other half.<n>We demonstrate that our approach produces high-quality dubbed videos with improved visual fidelity, lip synchronization, and robustness compared to existing dubbing pipelines.
arXiv Detail & Related papers (2026-01-29T18:57:13Z)
SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model [34.874153953305346]
Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content.<n>Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization.<n>We propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model.
arXiv Detail & Related papers (2025-11-23T16:51:05Z)
MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing [12.954750400557344]
We introduce a multi-modal generative framework for movie dubbing.<n>It produces high-quality dubbing using large speech generation models, guided by multi-modal inputs.<n>Results show superior performance compared to state-of-the-art (SOTA) methods.
arXiv Detail & Related papers (2025-05-22T06:23:05Z)
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models [43.1613638989795]
We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues.<n>This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals.
arXiv Detail & Related papers (2025-04-03T08:24:47Z)
DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance [4.452513686760606]
Key aspects such as adapting to different dubbing styles, handling dialogue, narration, and monologue effectively, have not been well studied.<n>We propose a framework of multi-modal large language model to address this challenge.<n>It generates high-quality dubbing through large speech generation models, guided by multimodal conditions.
arXiv Detail & Related papers (2025-03-31T01:51:09Z)
Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing [60.38045088180188]
We propose an acoustic-prosody disentangled two-stage method to achieve high-quality dubbing generation with precise prosody alignment.<n>We incorporate an in-domain emotion analysis module to reduce the impact of visual domain shifts across different movies.<n>Our method performs favorably against the state-of-the-art models on two primary benchmarks.
arXiv Detail & Related papers (2025-03-15T08:25:57Z)
StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing [125.86266166482704]
We propose StyleDubber, which switches dubbing learning from the frame level to phoneme level. It contains three main components: (1) A multimodal style adaptor operating at the phoneme level to learn pronunciation style from the reference audio, and generate intermediate representations informed by the facial emotion presented in the video; (2) An utterance-level style learning module, which guides both the mel-spectrogram decoding and the refining processes from the intermediate embeddings to improve the overall style expression; and (3) a phoneme-guided lip aligner to maintain lip sync.
arXiv Detail & Related papers (2024-02-20T01:28:34Z)
MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images [92.13079696503803]
We present MovieFactory, a framework to generate cinematic-picture (3072$times$1280), film-style (multi-scene), and multi-modality (sounding) movies. Our approach empowers users to create captivating movies with smooth transitions using simple text inputs.
arXiv Detail & Related papers (2023-06-12T17:31:23Z)
Learning to Dub Movies via Hierarchical Prosody Models [167.6465354313349]
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference. We propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene.
arXiv Detail & Related papers (2022-12-08T03:29:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.