MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing
- URL: http://arxiv.org/abs/2505.16279v1
- Date: Thu, 22 May 2025 06:23:05 GMT
- Title: MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing
- Authors: Junjie Zheng, Zihao Chen, Chaofan Ding, Yunming Liang, Yihan Fan, Huan Yang, Lei Xie, Xinhan Di,
- Abstract summary: We introduce a multi-modal generative framework for movie dubbing.<n>It produces high-quality dubbing using large speech generation models, guided by multi-modal inputs.<n>Results show superior performance compared to state-of-the-art (SOTA) methods.
- Score: 12.954750400557344
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current movie dubbing technology can produce the desired speech using a reference voice and input video, maintaining perfect synchronization with the visuals while effectively conveying the intended emotions. However, crucial aspects of movie dubbing, including adaptation to various dubbing styles, effective handling of dialogue, narration, and monologues, as well as consideration of subtle details such as speaker age and gender, remain insufficiently explored. To tackle these challenges, we introduce a multi-modal generative framework. First, it utilizes a multi-modal large vision-language model (VLM) to analyze visual inputs, enabling the recognition of dubbing types and fine-grained attributes. Second, it produces high-quality dubbing using large speech generation models, guided by multi-modal inputs. Additionally, a movie dubbing dataset with annotations for dubbing types and subtle details is constructed to enhance movie understanding and improve dubbing quality for the proposed multi-modal framework. Experimental results across multiple benchmark datasets show superior performance compared to state-of-the-art (SOTA) methods. In details, the LSE-D, SPK-SIM, EMO-SIM, and MCD exhibit improvements of up to 1.09%, 8.80%, 19.08%, and 18.74%, respectively.
Related papers
- Towards Film-Making Production Dialogue, Narration, Monologue Adaptive Moving Dubbing Benchmarks [6.71206005420634]
Talking Adaptive Dubbing Benchmarks (TA-Dubbing) designed to improve film production by adapting to dialogue, narration, monologue, and actors in movie dubbing.<n>TA-Dubbing offers several key advantages: 1) Comprehensive Dimensions: TA-Dubbing covers a variety of dimensions of movie dubbing, incorporating metric evaluations for both movie understanding and speech generation.
arXiv Detail & Related papers (2025-04-30T02:36:18Z) - VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models [43.1613638989795]
We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues.<n>This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals.
arXiv Detail & Related papers (2025-04-03T08:24:47Z) - DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance [4.452513686760606]
Key aspects such as adapting to different dubbing styles, handling dialogue, narration, and monologue effectively, have not been well studied.<n>We propose a framework of multi-modal large language model to address this challenge.<n>It generates high-quality dubbing through large speech generation models, guided by multimodal conditions.
arXiv Detail & Related papers (2025-03-31T01:51:09Z) - MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.<n>We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.<n>Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation [43.35578187209748]
Foley audio faces significant challenges in the AI-generated content (AIGC) landscape.
Current text-to-audio technology relies on detailed and acoustically relevant textual descriptions.
We introduce the Multi-modal Image and Narrative Text Dubbing dataset (MINT)
MINT is designed to enhance mainstream dubbing tasks such as literary story audiobooks dubbing, image/silent video dubbing.
arXiv Detail & Related papers (2024-06-15T10:47:36Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing [125.86266166482704]
We propose StyleDubber, which switches dubbing learning from the frame level to phoneme level.
It contains three main components: (1) A multimodal style adaptor operating at the phoneme level to learn pronunciation style from the reference audio, and generate intermediate representations informed by the facial emotion presented in the video; (2) An utterance-level style learning module, which guides both the mel-spectrogram decoding and the refining processes from the intermediate embeddings to improve the overall style expression; and (3) a phoneme-guided lip aligner to maintain lip sync.
arXiv Detail & Related papers (2024-02-20T01:28:34Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - Audio-Driven Dubbing for User Generated Contents via Style-Aware
Semi-Parametric Synthesis [123.11530365315677]
Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production.
In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production.
arXiv Detail & Related papers (2023-08-31T15:41:40Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.