DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance
- URL: http://arxiv.org/abs/2503.23660v1
- Date: Mon, 31 Mar 2025 01:51:09 GMT
- Title: DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance
- Authors: Junjie Zheng, Zihao Chen, Chaofan Ding, Xinhan Di,
- Abstract summary: Key aspects such as adapting to different dubbing styles, handling dialogue, narration, and monologue effectively, have not been well studied.<n>We propose a framework of multi-modal large language model to address this challenge.<n>It generates high-quality dubbing through large speech generation models, guided by multimodal conditions.
- Score: 4.452513686760606
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current movie dubbing technology can generate the desired voice from a given speech prompt, ensuring good synchronization between speech and visuals while accurately conveying the intended emotions. However, in movie dubbing, key aspects such as adapting to different dubbing styles, handling dialogue, narration, and monologue effectively, and understanding subtle details like the age and gender of speakers, have not been well studied. To address this challenge, we propose a framework of multi-modal large language model. First, it utilizes multimodal Chain-of-Thought (CoT) reasoning methods on visual inputs to understand dubbing styles and fine-grained attributes. Second, it generates high-quality dubbing through large speech generation models, guided by multimodal conditions. Additionally, we have developed a movie dubbing dataset with CoT annotations. The evaluation results demonstrate a performance improvement over state-of-the-art methods across multiple datasets. In particular, for the evaluation metrics, the SPK-SIM and EMO-SIM increases from 82.48% to 89.74%, 66.24% to 78.88% for dubbing setting 2.0 on V2C Animation dataset, LSE-D and MCD-SL decreases from 14.79 to 14.63, 5.24 to 4.74 for dubbing setting 2.0 on Grid dataset, SPK-SIM increases from 64.03 to 83.42 and WER decreases from 52.69% to 23.20% for initial reasoning setting on proposed CoT-Movie-Dubbing dataset in the comparison with the state-of-the art models.
Related papers
- MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing [12.954750400557344]
We introduce a multi-modal generative framework for movie dubbing.<n>It produces high-quality dubbing using large speech generation models, guided by multi-modal inputs.<n>Results show superior performance compared to state-of-the-art (SOTA) methods.
arXiv Detail & Related papers (2025-05-22T06:23:05Z) - FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing [78.83988199306901]
Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects.<n>Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality.<n>We propose FlowDubber, which achieves high-quality audio-visual sync and pronunciation by incorporating a large speech language model and dual contrastive aligning.
arXiv Detail & Related papers (2025-05-02T13:30:19Z) - Towards Film-Making Production Dialogue, Narration, Monologue Adaptive Moving Dubbing Benchmarks [6.71206005420634]
Talking Adaptive Dubbing Benchmarks (TA-Dubbing) designed to improve film production by adapting to dialogue, narration, monologue, and actors in movie dubbing.<n>TA-Dubbing offers several key advantages: 1) Comprehensive Dimensions: TA-Dubbing covers a variety of dimensions of movie dubbing, incorporating metric evaluations for both movie understanding and speech generation.
arXiv Detail & Related papers (2025-04-30T02:36:18Z) - VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models [43.1613638989795]
We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues.
This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals.
arXiv Detail & Related papers (2025-04-03T08:24:47Z) - DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation [6.315946909350621]
We propose an end-to-end multi-modal generation framework that simultaneously produces speech and audio based on video and text conditions.<n>The proposed framework, DeepAudio, consists of a video-to-audio (V2A) module, a text-to-speech (TTS) module, and a dynamic mixture of modality fusion (MoF) module.<n>In the evaluation, our framework achieves comparable results in the comparison with state-of-the-art models for the video-audio and text-speech benchmarks.
arXiv Detail & Related papers (2025-03-28T09:29:08Z) - MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.<n>We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.<n>Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - Multilingual Synopses of Movie Narratives: A Dataset for Vision-Language Story Understanding [19.544839928488972]
We construct a large-scale multilingual video story dataset named Multilingual Synopses of Movie Narratives (M-SYMON)
M-SYMON contains 13,166 movie summary videos from 7 languages, as well as manual annotation of fine-grained video-text correspondences for 101.5 hours of video.
Training on the human annotated data from SyMoN outperforms the SOTA methods by 15.7 and 16.2 percentage points on Clip Accuracy and Sentence IoU scores, respectively.
arXiv Detail & Related papers (2024-06-18T22:44:50Z) - MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation [43.35578187209748]
Foley audio faces significant challenges in the AI-generated content (AIGC) landscape.
Current text-to-audio technology relies on detailed and acoustically relevant textual descriptions.
We introduce the Multi-modal Image and Narrative Text Dubbing dataset (MINT)
MINT is designed to enhance mainstream dubbing tasks such as literary story audiobooks dubbing, image/silent video dubbing.
arXiv Detail & Related papers (2024-06-15T10:47:36Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing [125.86266166482704]
We propose StyleDubber, which switches dubbing learning from the frame level to phoneme level.
It contains three main components: (1) A multimodal style adaptor operating at the phoneme level to learn pronunciation style from the reference audio, and generate intermediate representations informed by the facial emotion presented in the video; (2) An utterance-level style learning module, which guides both the mel-spectrogram decoding and the refining processes from the intermediate embeddings to improve the overall style expression; and (3) a phoneme-guided lip aligner to maintain lip sync.
arXiv Detail & Related papers (2024-02-20T01:28:34Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - Learning to Dub Movies via Hierarchical Prosody Models [167.6465354313349]
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference.
We propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene.
arXiv Detail & Related papers (2022-12-08T03:29:04Z) - V2C: Visual Voice Cloning [55.55301826567474]
We propose a new task named Visual Voice Cloning (V2C)
V2C seeks to convert a paragraph of text to a speech with both desired voice specified by a reference audio and desired emotion specified by a reference video.
Our dataset contains 10,217 animated movie clips covering a large variety of genres.
arXiv Detail & Related papers (2021-11-25T03:35:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.