Fugu-MT 論文翻訳(概要): Empowering Video Translation using Multimodal Large Language Models

論文の概要: Empowering Video Translation using Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2604.11283v1
Date: Mon, 13 Apr 2026 10:42:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.477839
Title: Empowering Video Translation using Multimodal Large Language Models
Title（参考訳）: マルチモーダル大言語モデルを用いた動画翻訳
Authors: Bingzheng QU, Kehai Chen, Xuefeng Bai, Min Zhang,
Abstract要約: マルチモーダル大言語モデル(MLLM)は、ビデオ翻訳においてますます重要な役割を担っている。強力なマルチモーダル理解、推論、生成機能により、MLLMベースのビデオ翻訳システムは、従来のカスケードパイプラインの限界を克服している。本稿では,MLLMを用いたビデオ翻訳の総合的な概要について紹介する。
参考スコア（独自算出の注目度）: 33.07174302639983
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent developments in video translation have further enhanced cross-lingual access to video content, with multimodal large language models (MLLMs) playing an increasingly important supporting role. With strong multimodal understanding, reasoning, and generation capabilities, MLLMs-based video translation systems are overcoming the limitations of traditional cascaded pipelines that separately handle automatic speech recognition, machine translation, text-to-speech and lip synchronization. These MLLM-powered approaches not only achieve competitive or superior translation quality, but also demonstrate stronger robustness in zero-shot settings and multi-speaker scenarios, while jointly modeling semantic fidelity, timing, speaker identity, and emotional consistency. However, despite the rapid progress of MLLMs and extensive surveys on general video-language understanding, a focused and systematic review of how MLLMs empower video translation tasks is still lacking. To fill this gap, we provide the first comprehensive overview of MLLMs-based video translation, organized around a three-role taxonomy: 1) Semantic Reasoner, which characterizes how MLLMs perform video understanding, temporal reasoning, and multimodal fusion; 2) Expressive Performer, which analyzes LLM-driven and LLM-augmented techniques for expressive, controllable speech generation; and 3) Visual Synthesizer, which examines different types of video generators for high-fidelity lip-sync and visual alignment. Finally, we discuss open challenges in video understanding, temporal modeling, and multimodal alignment, and outline promising future research directions for MLLMs-powered video translation.
Abstract（参考訳）: 近年、ビデオ翻訳の進歩により、ビデオコンテンツへの言語間アクセスがさらに強化され、マルチモーダルな大言語モデル(MLLM)がますます重要な支援の役割を担っている。強力なマルチモーダル理解、推論、生成機能により、MLLMベースのビデオ翻訳システムは、自動音声認識、機械翻訳、テキスト音声合成、唇同期を別々に扱う従来のカスケードパイプラインの限界を克服している。これらのMLLMを利用したアプローチは、競争力や優れた翻訳品質を達成するだけでなく、ゼロショット設定やマルチスピーカーシナリオにおいて強い堅牢性を示す一方で、意味的忠実性、タイミング、話者アイデンティティ、感情的一貫性を共同でモデル化する。しかし、MLLMの急速な進歩と一般的なビデオ言語理解に関する広範な調査にもかかわらず、MLLMがビデオ翻訳タスクにどのように力を与えるかについての集中的かつ体系的なレビューはまだ不十分である。このギャップを埋めるために,3つの分類群を中心に編成されたMLLMのビデオ翻訳について,初めて概観する。 1)Semantic Reasonerは,MLLMが映像理解,時間的推論,マルチモーダル融合を行う様子を特徴付ける。 2) 表現的・制御可能な音声生成のためのLLM駆動・LLM拡張技術の解析を行う表現型パフォーマ 3) 高忠実度リップシンクと視覚アライメントのための様々なタイプのビデオジェネレータを検査するビジュアルシンセサイザー。最後に,ビデオ理解,時間的モデリング,マルチモーダルアライメントのオープンな課題について論じ,MLLMを用いた動画翻訳の今後の研究方向性について概説する。

論文の概要: Empowering Video Translation using Multimodal Large Language Models

関連論文リスト