Related papers: Waymo-3DSkelMo: A Multi-Agent 3D Skeletal Motion Dataset for Pedestrian Interaction Modeling in Autonomous Driving

Waymo-3DSkelMo: A Multi-Agent 3D Skeletal Motion Dataset for Pedestrian Interaction Modeling in Autonomous Driving

URL: http://arxiv.org/abs/2508.09404v1
Date: Wed, 13 Aug 2025 00:39:56 GMT
Title: Waymo-3DSkelMo: A Multi-Agent 3D Skeletal Motion Dataset for Pedestrian Interaction Modeling in Autonomous Driving
Authors: Guangxun Zhu, Shiyu Fan, Hang Dai, Edmond S. L. Ho,
Abstract summary: 3DSkelMo is the first large-scale dataset providing high-quality, temporally coherent 3D skeletal motions with explicit interaction semantics.<n>The dataset covers over 14,000 seconds across more than 800 real driving scenarios.
Score: 14.206170348283816
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large-scale high-quality 3D motion datasets with multi-person interactions are crucial for data-driven models in autonomous driving to achieve fine-grained pedestrian interaction understanding in dynamic urban environments. However, existing datasets mostly rely on estimating 3D poses from monocular RGB video frames, which suffer from occlusion and lack of temporal continuity, thus resulting in unrealistic and low-quality human motion. In this paper, we introduce Waymo-3DSkelMo, the first large-scale dataset providing high-quality, temporally coherent 3D skeletal motions with explicit interaction semantics, derived from the Waymo Perception dataset. Our key insight is to utilize 3D human body shape and motion priors to enhance the quality of the 3D pose sequences extracted from the raw LiDRA point clouds. The dataset covers over 14,000 seconds across more than 800 real driving scenarios, including rich interactions among an average of 27 agents per scene (with up to 250 agents in the largest scene). Furthermore, we establish 3D pose forecasting benchmarks under varying pedestrian densities, and the results demonstrate its value as a foundational resource for future research on fine-grained human behavior understanding in complex urban environments. The dataset and code will be available at https://github.com/GuangxunZhu/Waymo-3DSkelMo

Related papers

D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation [66.7166217399105]
Embodied agents face a critical dilemma that end-to-end models lack interpretability and explicit 3D reasoning.<n>Our model introduces two key innovations: 1) A Dynamic 3D Chain-of-Thought (3D CoT) that unifies planning, grounding, navigation, and question answering within a single 3D-VLM and CoT pipeline; 2) A Synergistic Learning from Fragmented Supervision (SLFS) strategy, which uses a masked autoregressive loss to learn from massive and partially-annotated hybrid data.
arXiv Detail & Related papers (2025-12-14T09:53:15Z)
InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation [54.09384502044162]
We introduce InterAct, a large-scale 3D HOI benchmark featuring dataset and methodological advancements.<n>First, we consolidate and standardize 21.81 hours of HOI data from diverse sources, enriching it with detailed textual annotations.<n>Second, we propose a unified optimization framework to enhance data quality by reducing artifacts and correcting hand motions.<n>Third, we define six benchmarking tasks and develop a unified HOI generative modeling perspective, achieving state-of-the-art performance.
arXiv Detail & Related papers (2025-09-11T15:43:54Z)
Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors [31.277540988829976]
This paper proposes a novel zero-shot HOI synthesis framework without relying on end-to-end training on currently limited 3D HOI datasets.<n>We employ pre-trained human pose estimation models to extract human poses and introduce a generalizable category-level 6-DoF estimation method to obtain the object poses from 2D HOI images.
arXiv Detail & Related papers (2025-03-25T23:55:47Z)
Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining [49.223455189395025]
Mocap-2-to-3 is a novel framework that performs multi-view lifting from monocular input.<n>To leverage abundant 2D data, we decompose complex 3D motion into multi-view syntheses.<n>Our method surpasses state-of-the-art approaches in both camera-space motion realism and world-grounded human positioning.
arXiv Detail & Related papers (2025-03-05T06:32:49Z)
HMP: Hand Motion Priors for Pose and Shape Estimation from Video [52.39020275278984]
We develop a generative motion prior specific for hands, trained on the AMASS dataset which features diverse and high-quality hand motions. Our integration of a robust motion prior significantly enhances performance, especially in occluded scenarios. We demonstrate our method's efficacy via qualitative and quantitative evaluations on the HO3D and DexYCB datasets.
arXiv Detail & Related papers (2023-12-27T22:35:33Z)
M3Act: Learning from Synthetic Human Group Activities [18.264989896254523]
M3Act is a synthetic data generator for multi-view multi-group multi-person human atomic actions and group activities. Powered by Unity Engine, M3Act features multiple semantic groups, highly diverse and photorealistic images. M3Act improves the state-of-the-art MOTRv2 on DanceTrack dataset, leading to a hop on the leaderboard from 10th to 2nd place.
arXiv Detail & Related papers (2023-06-29T08:13:57Z)
The MI-Motion Dataset and Benchmark for 3D Multi-Person Motion Prediction [13.177817435234449]
3D multi-person motion prediction is a challenging task that involves modeling individual behaviors and interactions between people. We introduce the Multi-Person Interaction Motion (MI-Motion) dataset, which includes skeleton sequences of multiple individuals collected by motion capture systems. The dataset contains 167k frames of interacting people's skeleton poses and is categorized into 5 different activity scenes.
arXiv Detail & Related papers (2023-06-23T15:38:22Z)
HUM3DIL: Semi-supervised Multi-modal 3D Human Pose Estimation for Autonomous Driving [95.42203932627102]
3D human pose estimation is an emerging technology, which can enable the autonomous vehicle to perceive and understand the subtle and complex behaviors of pedestrians. Our method efficiently makes use of these complementary signals, in a semi-supervised fashion and outperforms existing methods with a large margin. Specifically, we embed LiDAR points into pixel-aligned multi-modal features, which we pass through a sequence of Transformer refinement stages.
arXiv Detail & Related papers (2022-12-15T11:15:14Z)
Large Scale Interactive Motion Forecasting for Autonomous Driving : The Waymo Open Motion Dataset [84.3946567650148]
With over 100,000 scenes, each 20 seconds long at 10 Hz, our new dataset contains more than 570 hours of unique data over 1750 km of roadways. We use a high-accuracy 3D auto-labeling system to generate high quality 3D bounding boxes for each road agent. We introduce a new set of metrics that provides a comprehensive evaluation of both single agent and joint agent interaction motion forecasting models.
arXiv Detail & Related papers (2021-04-20T17:19:05Z)
Monocular Quasi-Dense 3D Object Tracking [99.51683944057191]
A reliable and accurate 3D tracking framework is essential for predicting future locations of surrounding objects and planning the observer's actions in numerous applications such as autonomous driving. We propose a framework that can effectively associate moving objects over time and estimate their full 3D bounding box information from a sequence of 2D images captured on a moving platform.
arXiv Detail & Related papers (2021-03-12T15:30:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.