Related papers: MAMMA: Markerless & Automatic Multi-Person Motion Action Capture

MAMMA: Markerless & Automatic Multi-Person Motion Action Capture

URL: http://arxiv.org/abs/2506.13040v2
Date: Tue, 24 Jun 2025 15:25:06 GMT
Title: MAMMA: Markerless & Automatic Multi-Person Motion Action Capture
Authors: Hanz Cuevas-Velasquez, Anastasios Yiannakidis, Soyong Shin, Giorgio Becherini, Markus Höschle, Joachim Tesch, Taylor Obersat, Tsvetelina Alexiadis, Michael J. Black,
Abstract summary: MAMMA is a markerless motion-capture pipeline that recovers SMPL-X parameters from multi-view video of two-person interaction sequences.<n>We introduce a method that predicts dense 2D surface landmarks conditioned on segmentation masks.<n>We demonstrate that our approach can handle complex person--person interaction and offers greater accuracy than existing methods.
Score: 37.06717786024836
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present MAMMA, a markerless motion-capture pipeline that accurately recovers SMPL-X parameters from multi-view video of two-person interaction sequences. Traditional motion-capture systems rely on physical markers. Although they offer high accuracy, their requirements of specialized hardware, manual marker placement, and extensive post-processing make them costly and time-consuming. Recent learning-based methods attempt to overcome these limitations, but most are designed for single-person capture, rely on sparse keypoints, or struggle with occlusions and physical interactions. In this work, we introduce a method that predicts dense 2D surface landmarks conditioned on segmentation masks, enabling person-specific correspondence estimation even under heavy occlusion. We employ a novel architecture that exploits learnable queries for each landmark. We demonstrate that our approach can handle complex person--person interaction and offers greater accuracy than existing methods. To train our network, we construct a large, synthetic multi-view dataset combining human motions from diverse sources, including extreme poses, hand motions, and close interactions. Our dataset yields high-variability synthetic sequences with rich body contact and occlusion, and includes SMPL-X ground-truth annotations with dense 2D landmarks. The result is a system capable of capturing human motion without the need for markers. Our approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions, without the extensive manual cleanup. Finally, we address the absence of common benchmarks for dense-landmark prediction and markerless motion capture by introducing two evaluation settings built from real multi-view sequences. We will release our dataset, benchmark, method, training code, and pre-trained model weights for research purposes.

Related papers

Learning to Track Any Points from Human Motion [55.831218129679144]
We propose an automated pipeline to generate pseudo-labeled training data for point tracking.<n>A point tracking model trained on AnthroTAP achieves annotated state-of-the-art performance on the TAP-Vid benchmark.
arXiv Detail & Related papers (2025-07-08T17:59:58Z)
FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video [52.33896173943054]
Egocentric motion capture with a head-mounted body-facing stereo camera is crucial for VR and AR applications.<n>Existing methods rely on synthetic pretraining and struggle to generate smooth and accurate predictions in real-world settings.<n>We propose FRAME, a simple yet effective architecture that combines device pose and camera feeds for state-of-the-art body pose prediction.
arXiv Detail & Related papers (2025-03-29T14:26:06Z)
SIGHT: Synthesizing Image-Text Conditioned and Geometry-Guided 3D Hand-Object Trajectories [124.24041272390954]
Modeling hand-object interaction priors holds significant potential to advance robotic and embodied AI systems.<n>We introduce SIGHT, a novel task focused on generating realistic and physically plausible 3D hand-object interaction trajectories from a single image.<n>We propose SIGHT-Fusion, a novel diffusion-based image-text conditioned generative model that tackles this task by retrieving the most similar 3D object mesh from a database.
arXiv Detail & Related papers (2025-03-28T20:53:20Z)
One-shot Human Motion Transfer via Occlusion-Robust Flow Prediction and Neural Texturing [21.613055849276385]
We propose a unified framework that combines multi-scale feature warping and neural texture mapping to recover better 2D appearance and 2.5D geometry.<n>Our model takes advantage of multiple modalities by jointly training and fusing them, which allows it to robust neural texture features that cope with geometric errors.
arXiv Detail & Related papers (2024-12-09T03:14:40Z)
Reconstructing Close Human Interactions from Multiple Views [38.924950289788804]
This paper addresses the challenging task of reconstructing the poses of multiple individuals engaged in close interactions, captured by multiple calibrated cameras. We introduce a novel system to address these challenges. Our system integrates a learning-based pose estimation component and its corresponding training and inference strategies.
arXiv Detail & Related papers (2024-01-29T14:08:02Z)
Markerless 3D human pose tracking through multiple cameras and AI: Enabling high accuracy, robustness, and real-time performance [0.0]
Tracking 3D human motion in real-time is crucial for numerous applications across many fields. Recent advances in Artificial Intelligence have allowed for markerless solutions. We propose a markerless framework that combines multi-camera views and 2D AI-based pose estimation methods to track 3D human motion.
arXiv Detail & Related papers (2023-03-31T15:06:50Z)
3D Human Mesh Estimation from Virtual Markers [34.703241940871635]
We present an intermediate representation, named virtual markers, which learns 64 landmark keypoints on the body surface. Our approach outperforms the state-of-the-art methods on three datasets.
arXiv Detail & Related papers (2023-03-21T10:30:43Z)
A Dual-Masked Auto-Encoder for Robust Motion Capture with Spatial-Temporal Skeletal Token Completion [13.88656793940129]
We propose an adaptive, identity-aware triangulation module to reconstruct 3D joints and identify the missing joints for each identity. We then propose a Dual-Masked Auto-Encoder (D-MAE) which encodes the joint status with both skeletal-structural and temporal position encoding for trajectory completion. In order to demonstrate the proposed model's capability in dealing with severe data loss scenarios, we contribute a high-accuracy and challenging motion capture dataset.
arXiv Detail & Related papers (2022-07-15T10:00:43Z)
MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic Grasping via Physics-based Metaverse Synthesis [78.26022688167133]
We present a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis. The proposed dataset contains 100,000 images and 25 different object types. We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance.
arXiv Detail & Related papers (2021-12-29T17:23:24Z)
SOMA: Solving Optical Marker-Based MoCap Automatically [56.59083192247637]
We train a novel neural network called SOMA, which takes raw mocap point clouds with varying numbers of points and labels them at scale. Soma exploits an architecture with stacked self-attention elements to learn the spatial structure of the 3D body. We automatically label over 8 hours of archival mocap data across 4 different datasets.
arXiv Detail & Related papers (2021-10-09T02:27:27Z)
Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation [61.98690211671168]
We propose a Multi-level Attention-Decoder Network (MAED) to model multi-level attentions in a unified framework. With the training set of 3DPW, MAED outperforms previous state-of-the-art methods by 6.2, 7.2, and 2.4 mm of PA-MPJPE.
arXiv Detail & Related papers (2021-09-06T09:06:17Z)
Recovering Trajectories of Unmarked Joints in 3D Human Actions Using Latent Space Optimization [16.914342116747825]
Motion capture (mocap) and time-of-flight based sensing of human actions are becoming increasingly popular modalities to perform robust activity analysis. However, there are several practical challenges in both modalities such as visibility, tracking errors, and simply the need to keep marker setup convenient. This paper addresses the problem of reconstructing the unmarked joint data as an ill-posed linear inverse problem. Experiments on both mocap and Kinect datasets clearly demonstrate that the proposed method performs very well in recovering semantics of the actions and dynamics of missing joints.
arXiv Detail & Related papers (2020-12-03T16:25:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.