Fugu-MT 論文翻訳(概要): Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints

論文の概要: Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints

arxiv url: http://arxiv.org/abs/2603.11755v1
Date: Thu, 12 Mar 2026 10:02:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:26.008168
Title: Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints
Title（参考訳）: Occlusion-Aware Sparse 3D Hand Joints による自己中心型映像の制御
Authors: Chenyangguang Zhang, Botao Ye, Boqi Chen, Alexandros Delitzas, Fangjinhua Wang, Marc Pollefeys, Xi Wang,
Abstract要約: モーションコントロール可能なビデオ生成は、仮想現実と組み込みAIにおけるエゴセントリックなアプリケーションに不可欠である。既存の手法は、しばしば3D一貫性のきめ細かい手話を実現するのに苦労する。単一の参照フレームからエゴセントリックなビデオを生成する新しいフレームワークを提案する。
参考スコア（独自算出の注目度）: 87.13154261503168
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Motion-controllable video generation is crucial for egocentric applications in virtual reality and embodied AI. However, existing methods often struggle to achieve 3D-consistent fine-grained hand articulation. By adopting on 2D trajectories or implicit poses, they collapse 3D geometry into spatially ambiguous signals or over rely on human-centric priors. Under severe egocentric occlusions, this causes motion inconsistencies and hallucinated artifacts, as well as preventing cross-embodiment generalization to robotic hands. To address these limitations, we propose a novel framework that generates egocentric videos from a single reference frame, leveraging sparse 3D hand joints as embodiment-agnostic control signals with clear semantic and geometric structures. We introduce an efficient control module that resolves occlusion ambiguities while fully preserving 3D information. Specifically, it extracts occlusion-aware features from the source reference frame by penalizing unreliable visual signals from hidden joints, and employs a 3D-based weighting mechanism to robustly handle dynamically occluded target joints during motion propagation. Concurrently, the module directly injects 3D geometric embeddings into the latent space to strictly enforce structural consistency. To facilitate robust training and evaluation, we develop an automated annotation pipeline that yields over one million high-quality egocentric video clips paired with precise hand trajectories. Additionally, we register humanoid kinematic and camera data to construct a cross-embodiment benchmark. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, generating high-fidelity egocentric videos with realistic interactions and exhibiting exceptional cross-embodiment generalization to robotic hands.
Abstract（参考訳）: モーションコントロール可能なビデオ生成は、仮想現実と組み込みAIにおけるエゴセントリックなアプリケーションに不可欠である。しかし、既存の手法は、しばしば3D一貫性のきめ細かい手話を実現するのに苦労する。 2次元軌跡や暗黙のポーズを採用することで、3次元幾何学を空間的にあいまいな信号に分解するか、あるいは人間中心の先行に依存する。重度の自我中心の閉塞下では、運動の不整合や幻覚を生じさせ、ロボットハンドへのクロス・エボディメントの一般化を防ぐ。これらの制約に対処するために,一本の参照フレームからエゴセントリックな映像を生成する新しいフレームワークを提案する。本稿では,3次元情報を完全に保存しつつ,閉塞の曖昧さを解消する効率的な制御モジュールを提案する。具体的には、隠れた関節から信頼できない視覚信号を貫通させることにより、ソース参照フレームから閉塞認識特徴を抽出し、3Dベースの重み付け機構を用いて、運動伝搬中に動的に閉塞された目標関節を頑健に処理する。同時に、モジュールは3次元幾何学的埋め込みを直接潜在空間に注入し、構造的整合性を厳格に強制する。頑健なトレーニングと評価を容易にするために,100万以上の高品質なエゴセントリックなビデオクリップと正確なハンドトラジェクトリを組み合わせた自動アノテーションパイプラインを開発した。さらに,人間型キネマティックとカメラのデータを登録して,クロス・エボディメント・ベンチマークを構築する。広汎な実験により、我々のアプローチは最先端のベースラインを著しく上回り、現実的な相互作用を伴う高忠実なエゴセントリックなビデオを生成し、ロボットハンドに例外的なクロス・エボディメントの一般化を示す。

論文の概要: Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints

関連論文リスト