Fugu-MT 論文翻訳(概要): Bernini: Latent Semantic Planning for Video Diffusion

論文の概要: Bernini: Latent Semantic Planning for Video Diffusion

arxiv url: http://arxiv.org/abs/2605.22344v1
Date: Thu, 21 May 2026 11:30:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:42.231286
Title: Bernini: Latent Semantic Planning for Video Diffusion
Title（参考訳）: Bernini氏:ビデオ拡散のための潜在セマンティックプランニング
Authors: Bernini Team, Chenchen Liu, Junyi Chen, Lei Li, Lu Chi, Mingzhen Sun, Zhuoying Li, Yi Fu, Ruoyu Guo, Yiheng Wu, Ge Bai, Zehuan Yuan,
Abstract要約: 本稿では,映像生成と編集のための統合フレームワークBerniniを提案する。 MLLMベースのプランナは、ViT埋め込み空間内でターゲットセマンティック表現を直接予測する。 Berniniは、幅広いビデオ生成と編集ベンチマークで最先端のパフォーマンスを実現している。
参考スコア（独自算出の注目度）: 28.951773363020077
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) and diffusion models have each reached remarkable maturity: MLLMs excel at reasoning over heterogeneous multimodal inputs with strong semantic grounding, while diffusion models synthesize images and videos with photorealistic fidelity. We argue that these two families can be unified through a simple division of labor: MLLMs perform semantic planning, while diffusion models render pixels from high-level semantic guidance and low-level visual features. Building on this idea, we propose Bernini, a unified framework for video generation and editing. An MLLM-based planner predicts the target semantic representation directly in the ViT embedding space, and a DiT-based renderer synthesizes pixels conditioned on this plan, augmented by text features and, for editing, source VAE features for detail preservation. Because semantics serve as the interface, the planner and renderer can be trained separately and only lightly co-trained, preserving the pretrained strengths of both components while keeping training efficient. To better handle multiple visual inputs, we introduce Segment-Aware 3D Rotary Positional Embedding (SA-3D RoPE), and further incorporate chain-of-thought reasoning in the planner to better transfer understanding into generation. Bernini achieves state-of-the-art performance across a wide range of video generation and editing benchmarks, with the MLLM's pretrained understanding translating into strong generalization on challenging editing tasks.
Abstract（参考訳）: MLLMは、強力なセマンティックグラウンドディングを持つ異質なマルチモーダル入力の推論に優れ、拡散モデルは、フォトリアリスティックなフィディリティで画像やビデオを合成する。 MLLMはセマンティックプランニングを行い、拡散モデルは高レベルのセマンティックガイダンスと低レベルの視覚特徴からピクセルを描画する。このアイデアに基づいて,ビデオ生成と編集のための統合フレームワークBerniniを提案する。 MLLMベースのプランナは、ViT埋め込み空間内でターゲットセマンティック表現を直接予測し、DiTベースのレンダラは、この計画で条件付けられたピクセルを合成し、テキスト機能によって拡張し、詳細な保存のためにVAE機能をソースする。セマンティクスがインターフェイスとして機能するため、プランナーとレンダラーは別々にトレーニングすることができ、トレーニングを効率よく保ちながら、両方のコンポーネントの事前トレーニングされた強度を保っている。複数の視覚的入力をよりよく扱うために、私たちは、Segment-Aware 3D Rotary Positional Embedding (SA-3D RoPE)を導入し、さらにプランナーにチェーン・オブ・シークレット推論を組み込んで、理解を世代に伝達する。バーニーニは、MLLMの事前訓練された理解によって、編集作業に挑戦する強力な一般化へと変換され、幅広いビデオ生成および編集ベンチマークで最先端のパフォーマンスを達成する。

論文の概要: Bernini: Latent Semantic Planning for Video Diffusion

関連論文リスト