MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
- URL: http://arxiv.org/abs/2512.10881v1
- Date: Thu, 11 Dec 2025 18:09:48 GMT
- Title: MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
- Authors: Kehong Gong, Zhengyu Wen, Weixia He, Mingxi Xu, Qi Wang, Ning Zhang, Zhengyu Li, Dongze Lian, Wei Zhao, Xiaoyu He, Mingyuan Zhang,
- Abstract summary: MoCapAnything is a reference-guided, factorized framework for 3D motion capture.<n>It reconstructs a rotation-based animation that directly drives the specific asset.<n>It delivers high-quality skeletal animations and meaningful cross-species animations.
- Score: 31.168481928653748
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: https://animotionlab.github.io/MoCapAnything/
Related papers
- Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation [32.57062686780495]
Superman is a unified framework that bridges visual perception with temporal, skeleton-based motion generation.<n>This module flexibly processes diverse, temporal inputs, unifying 3D skeleton pose estimation from video (perception) with skeleton-based motion prediction and in-betweening (generation)
arXiv Detail & Related papers (2026-02-02T17:59:01Z) - DIMO: Diverse 3D Motion Generation for Arbitrary Objects [57.14954351767432]
DIMO is a generative approach capable of generating diverse 3D motions for arbitrary objects from a single image.<n>We leverage the rich priors in well-trained video models to extract the common motion patterns.<n>During inference time with learned latent space, we can instantly sample diverse 3D motions in a single-forward pass.
arXiv Detail & Related papers (2025-11-10T18:56:49Z) - Restage4D: Reanimating Deformable 3D Reconstruction from a Single Video [56.781766315691854]
We introduce textbfRestage4D, a geometry-preserving pipeline for video-conditioned 4D restaging.<n>We validate Restage4D on DAVIS and PointOdyssey, demonstrating improved geometry consistency, motion quality, and 3D tracking performance.
arXiv Detail & Related papers (2025-08-08T21:31:51Z) - AnimaX: Animating the Inanimate in 3D with Joint Video-Pose Diffusion Models [24.410731608387238]
AnimaX is a feed-forward 3D animation framework that bridges the motion priors of video diffusion models with the controllable structure of skeleton-based animation.<n>Our method represents 3D motion as multi-view, multi-frame 2D pose maps, and enables joint video-pose diffusion conditioned on template renderings.
arXiv Detail & Related papers (2025-06-24T17:59:58Z) - SMF: Template-free and Rig-free Animation Transfer using Kinetic Codes [32.324844649352166]
Animation retargetting applies sparse motion description to a character mesh to produce a semantically plausible and temporally coherent full-body sequence.<n>We propose Self-supervised Motion Fields (SMF), a self-supervised framework that is trained with only sparse motion representations.<n>Our architecture comprises dedicated spatial and temporal gradient predictors, which are jointly trained in an end-to-end fashion.
arXiv Detail & Related papers (2025-04-07T08:42:52Z) - Recovering Dynamic 3D Sketches from Videos [30.87733869892925]
Liv3Stroke is a novel approach for abstracting objects in motion with deformable 3D strokes.<n>We first extract noisy, 3D point cloud motion guidance from video frames using semantic features.<n>Our approach deforms a set of curves to abstract essential motion features as a set of explicit 3D representations.
arXiv Detail & Related papers (2025-03-26T08:43:21Z) - Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics [79.4785166021062]
We introduce Puppet-Master, an interactive video generator that captures the internal, part-level motion of objects.<n>We demonstrate that Puppet-Master learns to generate part-level motions, unlike other motion-conditioned video generators.<n>Puppet-Master generalizes well to out-of-domain real images, outperforming existing methods on real-world benchmarks.
arXiv Detail & Related papers (2024-08-08T17:59:38Z) - Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos [47.97168047776216]
We introduce a new method for learning a generative model of articulated 3D animal motions from raw, unlabeled online videos.
Our model learns purely from a collection of unlabeled web video clips, leveraging semantic correspondences distilled from self-supervised image features.
arXiv Detail & Related papers (2023-12-21T06:44:18Z) - Reconstructing Animatable Categories from Videos [65.14948977749269]
Building animatable 3D models is challenging due to the need for 3D scans, laborious registration, and manual rigging.
We present RAC that builds category 3D models from monocular videos while disentangling variations over instances and motion over time.
We show that 3D models of humans, cats, and dogs can be learned from 50-100 internet videos.
arXiv Detail & Related papers (2023-05-10T17:56:21Z) - MoCaNet: Motion Retargeting in-the-wild via Canonicalization Networks [77.56526918859345]
We present a novel framework that brings the 3D motion task from controlled environments to in-the-wild scenarios.
It is capable of body motion from a character in a 2D monocular video to a 3D character without using any motion capture system or 3D reconstruction procedure.
arXiv Detail & Related papers (2021-12-19T07:52:05Z) - Video Autoencoder: self-supervised disentanglement of static 3D
structure and motion [60.58836145375273]
A video autoencoder is proposed for learning disentan- gled representations of 3D structure and camera pose from videos.
The representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following.
arXiv Detail & Related papers (2021-10-06T17:57:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.