Fugu-MT 論文翻訳(概要): FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing

論文の概要: FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing

arxiv url: http://arxiv.org/abs/2505.01263v1
Date: Fri, 02 May 2025 13:30:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-05 17:21:20.0402
Title: FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing
Title（参考訳）: FlowDubber: LLMベースのセマンティック学習とフローマッチングに基づく音声強調による映画ダビング
Authors: Gaoxiang Cong, Liang Li, Jiadong Pan, Zhedong Zhang, Amin Beheshti, Anton van den Hengel, Yuankai Qi, Qingming Huang,
Abstract要約: Movie Dubbingは、スクリプトを、時間的および感情的な両方の面において、所定の映画クリップと整合するスピーチに変換することを目的としている。既存の手法は、リップシンクと音響品質の重要性を無視しながら、単語エラー率の低減に重点を置いている。本研究では,大言語モデルと二重コントラスト整合を組み込むことで,高品質な音声・視覚同期と発音を実現するFlowDubberを提案する。
参考スコア（独自算出の注目度）: 78.83988199306901
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects while preserving the vocal timbre of a given brief reference audio. Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality. To address these issues, we propose a large language model (LLM) based flow matching architecture for dubbing, named FlowDubber, which achieves high-quality audio-visual sync and pronunciation by incorporating a large speech language model and dual contrastive aligning while achieving better acoustic quality via the proposed voice-enhanced flow matching than previous works. First, we introduce Qwen2.5 as the backbone of LLM to learn the in-context sequence from movie scripts and reference audio. Then, the proposed semantic-aware learning focuses on capturing LLM semantic knowledge at the phoneme level. Next, dual contrastive aligning (DCA) boosts mutual alignment with lip movement, reducing ambiguities where similar phonemes might be confused. Finally, the proposed Flow-based Voice Enhancing (FVE) improves acoustic quality in two aspects, which introduces an LLM-based acoustics flow matching guidance to strengthen clarity and uses affine style prior to enhance identity when recovering noise into mel-spectrograms via gradient vector field prediction. Extensive experiments demonstrate that our method outperforms several state-of-the-art methods on two primary benchmarks. The demos are available at {\href{https://galaxycong.github.io/LLM-Flow-Dubber/}{\textcolor{red}{https://galaxycong.github.io/LLM-Flow-Dubber/}}}.
Abstract（参考訳）: Movie Dubbingは、スクリプトを、与えられた短い参照音声の音声の音色を保ちながら、時間的および感情的な両方の面において、所定の映画クリップと整合した音声に変換することを目的としている。既存の手法は、リップシンクと音響品質の重要性を無視しながら、単語エラー率の低減に重点を置いている。これらの課題に対処するために,大言語モデル (LLM) に基づくフローマッチングアーキテクチャであるFlowDubberを提案する。このアーキテクチャは,大言語モデルと二重コントラスト整合を組み込んだ高品質な音声と視覚の同期と発音を実現し,従来よりも優れた音響品質を実現する。まず,LLMのバックボーンとしてQwen2.5を導入し,映画のスクリプトやリファレンスオーディオからテキスト中のシーケンスを学習する。提案した意味認識学習は,音素レベルでのLLM意味知識の獲得に重点を置いている。次に、二重コントラクティブアライメント(DCA)は、唇の動きとの相互整合性を高め、類似の音素が混同されるようなあいまいさを減らす。最後に,FVE (Flow-based Voice Enhancing) は音質を2つの面で改善し,LLMに基づく音響流マッチング手法を導入して明瞭度を高めるとともに,勾配ベクトル場予測によるメル-スペクトログラムへの雑音の回復に先立ってアフィンスタイルを用いる。大規模な実験により,本手法は2つの主要なベンチマークにおいていくつかの最先端手法より優れていることが示された。デモは {\href{https://galaxycong.github.io/LLM-Flow-Dubber/}{\textcolor{red}{https://galaxycong.github.io/LLM-Flow-Dubber/}}} で公開されている。

論文の概要: FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing

関連論文リスト