Fugu-MT 論文翻訳(概要): X2Video: Adapting Diffusion Models for Multimodal Controllable Neural Video Rendering

論文の概要: X2Video: Adapting Diffusion Models for Multimodal Controllable Neural Video Rendering

arxiv url: http://arxiv.org/abs/2510.08530v1
Date: Thu, 09 Oct 2025 17:50:31 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:15.278414
Title: X2Video: Adapting Diffusion Models for Multimodal Controllable Neural Video Rendering
Title（参考訳）: X2Video:マルチモーダル制御可能なニューラルビデオレンダリングのための拡散モデルの適用
Authors: Zhitong Huang, Mohan Zhang, Renhan Wang, Rui Tang, Hao Zhu, Jing Liao,
Abstract要約: X2Videoは、アルベド、正常、粗さ、金属性、照射を含む固有チャネルによって誘導される最初の拡散モデルである。グローバルリージョンとローカルリージョンの両方で参照イメージとテキストプロンプトを備えた直感的なマルチモーダルコントロールをサポートする。 X2Videoは、本質的な条件でガイドされた、長く、時間的に一貫性があり、フォトリアリスティックなビデオを生成することができる。
参考スコア（独自算出の注目度）: 25.939894201559426
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present X2Video, the first diffusion model for rendering photorealistic videos guided by intrinsic channels including albedo, normal, roughness, metallicity, and irradiance, while supporting intuitive multi-modal controls with reference images and text prompts for both global and local regions. The intrinsic guidance allows accurate manipulation of color, material, geometry, and lighting, while reference images and text prompts provide intuitive adjustments in the absence of intrinsic information. To enable these functionalities, we extend the intrinsic-guided image generation model XRGB to video generation by employing a novel and efficient Hybrid Self-Attention, which ensures temporal consistency across video frames and also enhances fidelity to reference images. We further develop a Masked Cross-Attention to disentangle global and local text prompts, applying them effectively onto respective local and global regions. For generating long videos, our novel Recursive Sampling method incorporates progressive frame sampling, combining keyframe prediction and frame interpolation to maintain long-range temporal consistency while preventing error accumulation. To support the training of X2Video, we assembled a video dataset named InteriorVideo, featuring 1,154 rooms from 295 interior scenes, complete with reliable ground-truth intrinsic channel sequences and smooth camera trajectories. Both qualitative and quantitative evaluations demonstrate that X2Video can produce long, temporally consistent, and photorealistic videos guided by intrinsic conditions. Additionally, X2Video effectively accommodates multi-modal controls with reference images, global and local text prompts, and simultaneously supports editing on color, material, geometry, and lighting through parametric tuning. Project page: https://luckyhzt.github.io/x2video
Abstract（参考訳）: 我々は,アルベド,正規性,粗さ,金属性,照度を含む内在的なチャネルでガイドされた光リアルな映像を描画する最初の拡散モデルであるX2Videoについて,参照画像とテキストプロンプトによる直感的なマルチモーダル制御をサポートしながら紹介する。この本質的なガイダンスは、色、材料、幾何学、照明の正確な操作を可能にし、参照画像とテキストプロンプトは本質的な情報がない場合に直感的な調整を提供する。これらの機能を実現するために、本質的な誘導画像生成モデルであるXRGBを、ビデオフレーム間の時間的整合性を確保し、参照画像への忠実性を高める、新規で効率的なハイブリッド自己認識を用いて、ビデオ生成に拡張する。さらに,グローバルなテキストプロンプトとローカルなテキストプロンプトを分離するMasked Cross-Attentionを開発し,各ローカルおよびグローバルな領域に効果的に適用する。長編ビデオを生成するために,キーフレーム予測とフレーム補間を組み合わせたプログレッシブフレームサンプリングを導入し,長時間の時間的一貫性を保ちながらエラーの蓄積を防止した。我々は,X2Videoのトレーニングを支援するために,295の室内シーンから1,154室の部屋を収容するビデオデータセットを作成した。質的および定量的な評価は、X2Videoが、本質的な条件によってガイドされた、長く、時間的に一貫性があり、フォトリアリスティックなビデオを生成することを示す。さらに、X2Videoは、参照画像、グローバルおよびローカルテキストプロンプトによるマルチモーダル制御を効果的にサポートし、パラメトリックチューニングによる色、材料、幾何学、照明の編集を同時にサポートする。プロジェクトページ: https://luckyhzt.github.io/x2 video

論文の概要: X2Video: Adapting Diffusion Models for Multimodal Controllable Neural Video Rendering

関連論文リスト