Fugu-MT 論文翻訳(概要): Lighting-grounded Video Generation with Renderer-based Agent Reasoning

論文の概要: Lighting-grounded Video Generation with Renderer-based Agent Reasoning

arxiv url: http://arxiv.org/abs/2604.07966v1
Date: Thu, 09 Apr 2026 08:29:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:05.80327
Title: Lighting-grounded Video Generation with Renderer-based Agent Reasoning
Title（参考訳）: Renderer-based Agent Reasoning を用いた照明グラウンド映像生成
Authors: Ziqi Cai, Taoyu Yang, Zheng Chang, Si Li, Han Jiang, Shuchen Weng, Boxin Shi,
Abstract要約: LiVERは、シーン制御可能なビデオ生成のための拡散ベースのフレームワークである。本稿では, 映像合成を明示的な3次元シーン特性に適応させる新しい枠組みを提案する。本手法は, 統一された3次元表現から制御信号を描画することで, これらの特性をアンタングル化する。
参考スコア（独自算出の注目度）: 56.50946217758078
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.
Abstract（参考訳）: 拡散モデルはビデオ生成において顕著な進歩を遂げているが、制御性は依然として大きな限界である。レイアウト、照明、カメラ軌道といった重要なシーン要素は、しばしば絡み合ったり弱められただけであり、明示的なシーン制御が不可欠であるフィルム製造や仮想プロダクションのような領域で適用性を制限する。本稿では,シーン制御可能な映像生成のための拡散型フレームワークであるLiVERを紹介する。そこで本研究では,オブジェクトレイアウト,照明,カメラパラメータのアノテーションを付加した大規模データセットによって支援された,明示的な3Dシーン特性に基づく映像合成を行う新しいフレームワークを提案する。本手法は, 統一された3次元表現から制御信号を描画することで, これらの特性をアンタングル化する。我々は,これらの信号を基礎となるビデオ拡散モデルに統合し,安定した収束と高忠実性を確保するための,軽量な条件付きモジュールとプログレッシブトレーニング戦略を提案する。本フレームワークは,3Dシーンを完全に編集可能な映像間合成や映像間合成など,幅広いアプリケーションを実現する。ユーザビリティをさらに向上するため,高レベルのユーザ命令を必要な3D制御信号に自動的に変換するシーンエージェントを開発した。実験により、LiVERはシーンファクターの正確で不整合な制御を可能にしながら、最先端のフォトリアリズムと時間的一貫性を実現し、制御可能なビデオ生成のための新しい標準を確立した。

論文の概要: Lighting-grounded Video Generation with Renderer-based Agent Reasoning

関連論文リスト