Fugu-MT 論文翻訳(概要): Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

論文の概要: Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

arxiv url: http://arxiv.org/abs/2606.13655v2
Date: Sat, 13 Jun 2026 07:52:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 13:45:31.216342
Title: Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction
Title（参考訳）: Flex4DHuman:4次元再構成のためのフレキシブル多視点ビデオ拡散
Authors: Jen-Hao Cheng, Yipeng Wang, Hao Zhang, Gengshan Yang, Jenq-Neng Hwang,
Abstract要約: 動的対象の単眼的あるいはスパースな多視点映像を高密度な多視点映像に変換する多視点拡散モデルFlex4DHumanを提案する。骨格、深度マップ、ノーマル、レンダリングされたターゲットビュー幾何に依存する従来の人間中心の手法とは異なり、Flex4DHumanは明示的な幾何学的先行を必要としない。
参考スコア（独自算出の注目度）: 32.69813032650073
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on the Wan 2.1 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats. Experiments on DNA-Rendering and ActorsHQ show that Flex4DHuman surpasses prior state-of-the-art methods, while the same formulation generalizes to animal categories after mixed human-animal training. These capabilities make Flex4DHuman a practical step toward scalable 4D content creation from casual monocular videos for simulation, gaming, AR/VR, and video re-shooting.
Abstract（参考訳）: 本稿では,映像拡散モデルであるFlex4DHumanについて述べる。骨格、深度マップ、ノーマル、レンダリングされたターゲットビュー幾何に依存する従来の人間中心の手法とは異なり、Flex4DHumanは明示的な幾何学的先行を必要とせず、代わりに相対的なカメラ目的の位置エンコーディングを通じて条件を生成する。生成されたビデオは、下流の再構築パイプラインに直接取り込み、ダイナミックな4Dガウススプラットを作成することができる。 Wan 2.1 1.3Bのテキスト・ツー・ビデオモデルに基づいて構築されたFlex4DHumanは、バックボーンアーキテクチャを保存し、ビューインデックスと連続SE(3)相対カメラ幾何で時空間RoPEを拡張する5軸位置符号化を通じて、カメラとビュー情報をエンコードする。 3段階のカリキュラムは、後続のポーズ、フレキシブルなリファレンス・ツー・ターゲットビュー生成、一時的なロールアウトのためのモデルを段階的にトレーニングする。時間的ロールアウトをサポートするために、私たちは、クリーンな歴史的なターゲットビュートークンでトレーニングします。テストタイムのテキストコントロールを可能にするために、複数ビューのキャプションも追加します。我々のフレームワークは、市販の4Dガウシアン・スプラッティング・ステージと組み合わせて、モノクロの静止カメラ映像を動的4Dガウシアン・スプラッティングに持ち上げる。 DNA-RenderingとActorsHQの実験では、Flex4DHumanは最先端の手法を超越しているが、同じ定式化は人間と動物を混在させた訓練の後に動物カテゴリーに一般化している。これらの機能によりFlex4DHumanは、シミュレーション、ゲーム、AR/VR、ビデオの再撮影のためのカジュアルなモノクロビデオからスケーラブルな4Dコンテンツを作成するための実用的なステップとなる。

論文の概要: Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

関連論文リスト