Fugu-MT 論文翻訳(概要): UniVideo: Unified Understanding, Generation, and Editing for Videos

論文の概要: UniVideo: Unified Understanding, Generation, and Editing for Videos

arxiv url: http://arxiv.org/abs/2510.08377v1
Date: Thu, 09 Oct 2025 16:01:30 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:15.179983
Title: UniVideo: Unified Understanding, Generation, and Editing for Videos
Title（参考訳）: UniVideo:ビデオの統一的な理解、生成、編集
Authors: Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen,
Abstract要約: 統合モデリングをビデオ領域に拡張する汎用フレームワークUniVideoを提案する。 UniVideoは、単一のマルチモーダル命令パラダイムの下で、多様なビデオ生成と編集タスクを統合する。 We show that UniVideo match or over the state-the-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing。
参考スコア（独自算出の注目度）: 60.90505182401494
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.
Abstract（参考訳）: 統一マルチモーダルモデルは、マルチモーダルコンテンツの生成と編集において有望な結果を示しているが、画像領域に限られている。本稿では,ビデオ領域に統一モデリングを拡張する汎用フレームワークUniVideoを紹介する。 UniVideo はマルチモーダル大言語モデル (MLLM) とビデオ生成のためのマルチモーダル DiT (MMDiT) を組み合わせたデュアルストリーム設計を採用している。この設計は、視覚的一貫性を維持しながら、複雑なマルチモーダル命令の正確な解釈を可能にする。このアーキテクチャに基づいて構築されたUniVideoは、単一のマルチモーダル命令パラダイムの下で多様なビデオ生成と編集タスクを統一し、それらを共同で訓練する。広範にわたる実験により、UniVideoはテキスト/画像・ビデオ生成、テキスト内ビデオ生成、テキスト内ビデオ編集において、最先端のタスク固有のベースラインと一致または超えることを示した。特に、UniVideoの統一設計は2種類の一般化を可能にする。まず、UniVideoは1つの命令に複数の機能を統合することで、編集とスタイル転送を組み合わせたタスク構成をサポートする。第二に、自由形式のビデオ編集を明示的に訓練することなく、UniVideoはその編集機能を大規模な画像編集データからこの設定に転送し、グリーンスクリーン文字やビデオ内の素材の変更といった目に見えない命令を処理する。これらのコア機能以外にも、UniVideoはビジュアルプロンプトベースのビデオ生成をサポートしており、MLLMは合成中に視覚的プロンプトを解釈し、MMDiTを誘導する。将来の研究を促進するため、私たちはモデルとコードを公開します。

論文の概要: UniVideo: Unified Understanding, Generation, and Editing for Videos

関連論文リスト