Fugu-MT 論文翻訳(概要): MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

論文の概要: MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

arxiv url: http://arxiv.org/abs/2604.28130v1
Date: Thu, 30 Apr 2026 17:16:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 16:31:54.219517
Title: MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons
Title（参考訳）: MoCapAnything V2: 任意骨格のエンドツーエンドモーションキャプチャ
Authors: Kehong Gong, Zhengyu Wen, Dao Thien Phong, Mingxi Xu, Weixia He, Qi Wang, Ning Zhang, Zhengyu Li, Guanli Hou, Dongze Lian, Xiaoyu He, Mingyuan Zhang, Hanwang Zhang,
Abstract要約: 本稿では,ビデオ・ツー・ローテーションとビデオ・ツー・ローテーションを共同で学習し,最適化する,最初のエンドツーエンドフレームワークを提案する。本手法は, メッシュベースパイプラインの20倍の速度で, 回転誤差を17度から10度, 見えない骨格では6.54度に低減する。
参考スコア（独自算出の注目度）: 56.68975315643491
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent methods for arbitrary-skeleton motion capture from monocular video follow a factorized pipeline, where a Video-to-Pose network predicts joint positions and an analytical inverse-kinematics (IK) stage recovers joint rotations. While effective, this design is inherently limited, since joint positions do not fully determine rotations and leave degrees of freedom such as bone-axis twist ambiguous, and the non-differentiable IK stage prevents the system from adapting to noisy predictions or optimizing for the final animation objective. In this work, we present the first fully end-to-end framework in which both Video-to-Pose and Pose-to-Rotation are learnable and jointly optimized. We observe that the ambiguity in pose-to-rotation mapping arises from missing coordinate system information: the same joint positions can correspond to different rotations under different rest poses and local axis conventions. To resolve this, we introduce a reference pose-rotation pair from the target asset, which, together with the rest pose, not only anchors the mapping but also defines the underlying rotation coordinate system. This formulation turns rotation prediction into a well-constrained conditional problem and enables effective learning. In addition, our model predicts joint positions directly from video without relying on mesh intermediates, improving both robustness and efficiency. Both stages share a skeleton-aware Global-Local Graph-guided Multi-Head Attention (GL-GMHA) module for joint-level local reasoning and global coordination. Experiments on Truebones Zoo and Objaverse show that our method reduces rotation error from ~17 degrees to ~10 degrees, and to 6.54 degrees on unseen skeletons, while achieving ~20x faster inference than mesh-based pipelines. Project page: https://animotionlab.github.io/MoCapAnythingV2/
Abstract（参考訳）: 近年の単眼ビデオからの任意の骨格モーションキャプチャの手法は、ビデオ・ツー・Poseネットワークが関節位置を予測し、解析的逆運動学(IK)段階が関節回転を回復する分解パイプラインに従っている。この設計は本質的に制限されているが、関節の位置が骨軸ねじれなどの自由度を完全に決定しないため、非微分不可能なIKステージは、システムがノイズの予測に適応したり、最終的なアニメーションの目的を最適化するのを防ぐ。本研究では,ビデオ・ツー・PoseとPose-to-Rotationの両方が学習可能で,協調的に最適化される,エンドツーエンドのフレームワークを初めて提示する。ポーズ・ツー・ローテーションマッピングの曖昧さは座標系情報の欠如から生じ、同じ関節位置は異なるレストポーズと局所軸規則の下で異なる回転に対応することができる。これを解決するために、ターゲット資産からの参照ポーズ-回転対を導入し、残りのポーズとともに、マッピングをアンカーするだけでなく、基礎となる回転座標系も定義する。この定式化は回転予測をよく制約された条件付き問題に変換し、効果的な学習を可能にする。さらに,メッシュ中間体に頼らずに映像から直接関節位置を推定し,ロバスト性および効率性を向上する。両方のステージは、共同レベルのローカル推論とグローバルコーディネーションのための、スケルトン対応のGlobal-Local Graph-Guided Multi-Head Attention (GL-GMHA)モジュールを共有している。 Truebones Zoo と Objaverse の実験により、我々の手法は、メッシュベースのパイプラインよりも約20倍早く、回転誤差を約17度から約10度に減らし、見えない骨格では6.54度に減らした。プロジェクトページ: https://animotionlab.github.io/MoCapAnythingV2/

論文の概要: MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

関連論文リスト