Learning to Track Any Points from Human Motion
- URL: http://arxiv.org/abs/2507.06233v1
- Date: Tue, 08 Jul 2025 17:59:58 GMT
- Title: Learning to Track Any Points from Human Motion
- Authors: Inès Hyeonsu Kim, Seokju Cho, Jahyeok Koo, Junghyun Park, Jiahui Huang, Joon-Young Lee, Seungryong Kim,
- Abstract summary: We propose an automated pipeline to generate pseudo-labeled training data for point tracking.<n>A point tracking model trained on AnthroTAP achieves annotated state-of-the-art performance on the TAP-Vid benchmark.
- Score: 55.831218129679144
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human motion, with its inherent complexities, such as non-rigid deformations, articulated movements, clothing distortions, and frequent occlusions caused by limbs or other individuals, provides a rich and challenging source of supervision that is crucial for training robust and generalizable point trackers. Despite the suitability of human motion, acquiring extensive training data for point tracking remains difficult due to laborious manual annotation. Our proposed pipeline, AnthroTAP, addresses this by proposing an automated pipeline to generate pseudo-labeled training data, leveraging the Skinned Multi-Person Linear (SMPL) model. We first fit the SMPL model to detected humans in video frames, project the resulting 3D mesh vertices onto 2D image planes to generate pseudo-trajectories, handle occlusions using ray-casting, and filter out unreliable tracks based on optical flow consistency. A point tracking model trained on AnthroTAP annotated dataset achieves state-of-the-art performance on the TAP-Vid benchmark, surpassing other models trained on real videos while using 10,000 times less data and only 1 day in 4 GPUs, compared to 256 GPUs used in recent state-of-the-art.
Related papers
- SpatialTrackerV2: 3D Point Tracking Made Easy [73.0350898700048]
SpatialTrackerV2 is a feed-forward 3D point tracking method for monocular videos.<n>It decomposes world-space 3D motion into scene geometry, camera ego-motion, and pixel-wise object motion.<n>By learning geometry and motion jointly from such heterogeneous data, SpatialTrackerV2 outperforms existing 3D tracking methods by 30%.
arXiv Detail & Related papers (2025-07-16T17:59:03Z) - UAVTwin: Neural Digital Twins for UAVs using Gaussian Splatting [57.63613048492219]
We present UAVTwin, a method for creating digital twins from real-world environments and facilitating data augmentation for training downstream models embedded in unmanned aerial vehicles (UAVs)<n>This is achieved by integrating 3D Gaussian Splatting (3DGS) for reconstructing backgrounds along with controllable synthetic human models that display diverse appearances and actions in multiple poses.
arXiv Detail & Related papers (2025-04-02T22:17:30Z) - 3D Multi-Object Tracking with Semi-Supervised GRU-Kalman Filter [6.13623925528906]
3D Multi-Object Tracking (MOT) is essential for intelligent systems like autonomous driving and robotic sensing.
We propose a GRU-based MOT method, which introduces a learnable Kalman filter into the motion module.
This approach is able to learn object motion characteristics through data-driven learning, thereby avoiding the need for manual model design and model error.
arXiv Detail & Related papers (2024-11-13T08:34:07Z) - TAPVid-3D: A Benchmark for Tracking Any Point in 3D [63.060421798990845]
We introduce a new benchmark, TAPVid-3D, for evaluating the task of Tracking Any Point in 3D.
This benchmark will serve as a guidepost to improve our ability to understand precise 3D motion and surface deformation from monocular video.
arXiv Detail & Related papers (2024-07-08T13:28:47Z) - 3D Face Tracking from 2D Video through Iterative Dense UV to Image Flow [15.479024531161476]
We propose a novel face tracker, FlowFace, that introduces an innovative 2D alignment network for dense per-vertex alignment.
Unlike prior work, FlowFace is trained on high-quality 3D scan annotations rather than weak supervision or synthetic data.
Our method exhibits superior performance on both custom and publicly available benchmarks.
arXiv Detail & Related papers (2024-04-15T14:20:07Z) - NICP: Neural ICP for 3D Human Registration at Scale [35.631505786332454]
We propose a neural scalable registration method, NSR, for 3D Human registration.
NSR generalizes and scales across thousands of shapes and more than ten different data sources.
Our essential contribution is NICP, an ICP-style self-supervised task tailored to neural fields.
arXiv Detail & Related papers (2023-12-21T16:54:09Z) - Reducing Training Demands for 3D Gait Recognition with Deep Koopman
Operator Constraints [8.382355998881879]
We introduce a new Linear Dynamical Systems (LDS) module and loss based on Koopman operator theory, which provides an unsupervised motion regularization for the periodic nature of gait.
We also show that our 3D modeling approach is much better than other 3D gait approaches in overcoming viewpoint variation under normal, bag-carrying and clothing change conditions.
arXiv Detail & Related papers (2023-08-14T21:39:33Z) - 3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking [15.330384668966806]
State-of-the-art 3D multi-object tracking (MOT) approaches typically rely on non-learned model-based algorithms such as Kalman Filter.
We propose 3DMOTFormer, a learned geometry-based 3D MOT framework building upon the transformer architecture.
Our approach achieves 71.2% and 68.2% AMOTA on the nuScenes validation and test split, respectively.
arXiv Detail & Related papers (2023-08-12T19:19:58Z) - CROMOSim: A Deep Learning-based Cross-modality Inertial Measurement
Simulator [7.50015216403068]
Inertial measurement unit (IMU) data has been utilized in monitoring and assessment of human mobility.
To mitigate the data scarcity problem, we design CROMOSim, a cross-modality sensor simulator.
It simulates high fidelity virtual IMU sensor data from motion capture systems or monocular RGB cameras.
arXiv Detail & Related papers (2022-02-21T22:30:43Z) - Monocular Quasi-Dense 3D Object Tracking [99.51683944057191]
A reliable and accurate 3D tracking framework is essential for predicting future locations of surrounding objects and planning the observer's actions in numerous applications such as autonomous driving.
We propose a framework that can effectively associate moving objects over time and estimate their full 3D bounding box information from a sequence of 2D images captured on a moving platform.
arXiv Detail & Related papers (2021-03-12T15:30:02Z) - Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D
Human Pose Estimation [107.07047303858664]
Large-scale human datasets with 3D ground-truth annotations are difficult to obtain in the wild.
We address this problem by augmenting existing 2D datasets with high-quality 3D pose fits.
The resulting annotations are sufficient to train from scratch 3D pose regressor networks that outperform the current state-of-the-art on in-the-wild benchmarks.
arXiv Detail & Related papers (2020-04-07T20:21:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.