Fugu-MT 論文翻訳(概要): Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization

論文の概要: Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization

arxiv url: http://arxiv.org/abs/2606.02000v1
Date: Mon, 01 Jun 2026 09:56:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:31.773353
Title: Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization
Title（参考訳）: 3次元映像拡散モデルに向けて:メッシュトークン化によるレンダーフリーヒューマンモーションコントロール
Authors: Jingyun Liang, Min Wei, Shikai Li, Yizeng Han, Hangjie Yuan, Lei Sun, Weihua Chen, Fan Wang,
Abstract要約: 圧縮された3次元メッシュトークン上で直接ビデオ生成を行うレンダリングフリーフレームワークを提案する。この表現は、統一されたトークンベースの生成パイプラインを可能にしながら、完全な3D幾何情報を保存する。実験の結果,人間の動作制御ベンチマークにおいて,ビュー依存の2D誘導や軌道上のミスマッチによって誘導されるアーティファクトを低減しつつ,強い性能を示した。
参考スコア（独自算出の注目度）: 41.562044736774816
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Diffusion models have shown remarkable success in video generation. However, whether such models are truly aware of the 3D structure underlying visual observations, rather than simply reproducing plausible 2D projections, remains an open question. In this work, we investigate this question through human motion control, a task that requires precise modelling of 3D human geometry, motion, camera viewpoint, and scene context. Unlike prior methods that rely on rendered 2D motion guidance videos, we propose a render-free framework that conditions video generation directly on compressed 3D human mesh tokens. This representation preserves full 3D geometric information while enabling a unified token-based generation pipeline that processes video tokens jointly with motion tokens in a DiT-based architecture. This design requires the model to reason jointly about appearance, 3D structure, and camera viewpoint during video generation. Experimental results demonstrate strong performance on human motion control benchmarks, while reducing artifacts induced by view-dependent 2D guidance and trajectory-pose mismatches during editing. These findings suggest that video diffusion models, when equipped with mesh tokenization, can better capture complex 3D human structures and their interactions with the surrounding environment.
Abstract（参考訳）: 拡散モデルはビデオ生成において顕著な成功を収めている。しかし、そのようなモデルが単に可視な2次元投影を再現するのではなく、視覚的観察の基盤となる3次元構造を真に認識しているかどうかは未解決のままである。本研究では, 人間の3次元形状, 動き, カメラ視点, シーンコンテキストを正確にモデル化する作業である, 人間の動作制御を通して, この課題を考察する。レンダリングされた2Dモーションガイダンスビデオに依存する従来の手法とは違って、圧縮された3Dメッシュトークンに直接ビデオを生成するレンダリングフリーなフレームワークを提案する。この表現は、DiTベースのアーキテクチャでモーショントークンと共同でビデオトークンを処理する統一トークンベースの生成パイプラインを可能にしながら、完全な3D幾何学情報を保存する。この設計では、ビデオ生成中に外見、立体構造、カメラ視点を共同で考える必要がある。実験の結果,人間の動作制御ベンチマークにおいて,ビュー依存の2D誘導や軌道上のミスマッチによって誘導されるアーティファクトを低減しつつ,強い性能を示すことができた。これらの結果から,メッシュトークン化を応用した映像拡散モデルにより,複雑な3次元人体構造と周囲環境との相互作用をよりよく捉えることが可能であることが示唆された。

論文の概要: Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization

関連論文リスト