Related papers: Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild

Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild

URL: http://arxiv.org/abs/2406.09905v1
Date: Fri, 14 Jun 2024 10:23:53 GMT
Title: Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild
Authors: Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, Kevin Bailey, David Soriano Fosas, C. Karen Liu, Ziwei Liu, Jakob Engel, Renzo De Nardi, Richard Newcombe,
Abstract summary: The Nymeria dataset is a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices. It contains 1200 recordings of 300 hours of daily activities from 264 participants across 50 locations, travelling a total of 399Km. The motion-language descriptions provide 310.5K sentences in 8.64M words from a vocabulary size of 6545.
Score: 66.34146236875822
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We introduce Nymeria - a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices. The dataset comes with a) full-body 3D motion ground truth; b) egocentric multimodal recordings from Project Aria devices with RGB, grayscale, eye-tracking cameras, IMUs, magnetometer, barometer, and microphones; and c) an additional "observer" device providing a third-person viewpoint. We compute world-aligned 6DoF transformations for all sensors, across devices and capture sessions. The dataset also provides 3D scene point clouds and calibrated gaze estimation. We derive a protocol to annotate hierarchical language descriptions of in-context human motion, from fine-grain pose narrations, to atomic actions and activity summarization. To the best of our knowledge, the Nymeria dataset is the world largest in-the-wild collection of human motion with natural and diverse activities; first of its kind to provide synchronized and localized multi-device multimodal egocentric data; and the world largest dataset with motion-language descriptions. It contains 1200 recordings of 300 hours of daily activities from 264 participants across 50 locations, travelling a total of 399Km. The motion-language descriptions provide 310.5K sentences in 8.64M words from a vocabulary size of 6545. To demonstrate the potential of the dataset we define key research tasks for egocentric body tracking, motion synthesis, and action recognition and evaluate several state-of-the-art baseline algorithms. Data and code will be open-sourced.

Related papers

Embody 3D: A Large-scale Multimodal Motion and Behavior Dataset [64.37557018733408]
Embody 3D is a multimodal dataset of 500 individual hours of 3D motion data from 439 participants collected in a multi-camera collection stage.<n>The dataset features a wide range of single-person motion data, including prompted motions, hand gestures, and locomotion.<n>We provide tracked human motion including hand tracking and body shape, text annotations, and a separate audio track for each participant.
arXiv Detail & Related papers (2025-10-17T23:06:36Z)
HUMOTO: A 4D Dataset of Mocap Human Object Interactions [27.573065832588554]
Human Motions with Objects is a high-fidelity dataset of human-object interactions for motion generation, computer vision, and robotics applications. Humoto captures interactions with 63 precisely modeled objects and 72 articulated parts. Professional artists rigorously clean and verify each sequence, minimizing foot sliding and object penetrations.
arXiv Detail & Related papers (2025-04-14T16:59:29Z)
MM-Conv: A Multi-modal Conversational Dataset for Virtual Humans [4.098892268127572]
We present a novel dataset captured using a VR headset to record conversations between participants within a physics simulator (AI2-THOR) Our primary objective is to extend the field of co-speech gesture generation by incorporating rich contextual information within referential settings.
arXiv Detail & Related papers (2024-09-30T21:51:30Z)
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World [55.878173953175356]
We propose MultiPLY, a multisensory embodied large language model. We first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data. We demonstrate that MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks.
arXiv Detail & Related papers (2024-01-16T18:59:45Z)
Aria-NeRF: Multimodal Egocentric View Synthesis [17.0554791846124]
We seek to accelerate research in developing rich, multimodal scene models trained from egocentric data, based on differentiable volumetric ray-tracing inspired by Neural Radiance Fields (NeRFs) This dataset offers a comprehensive collection of sensory data, featuring RGB images, eye-tracking camera footage, audio recordings from a microphone, atmospheric pressure readings from a barometer, positional coordinates from GPS, and information from dual-frequency IMU datasets (1kHz and 800Hz) The diverse data modalities and the real-world context captured within this dataset serve as a robust foundation for furthering our understanding of human behavior and enabling more immersive and intelligent experiences in
arXiv Detail & Related papers (2023-11-11T01:56:35Z)
DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-centric Rendering [126.00165445599764]
We present DNA-Rendering, a large-scale, high-fidelity repository of human performance data for neural actor rendering. Our dataset contains over 1500 human subjects, 5000 motion sequences, and 67.5M frames' data volume. We construct a professional multi-view system to capture data, which contains 60 synchronous cameras with max 4096 x 3000 resolution, 15 fps speed, and stern camera calibration steps.
arXiv Detail & Related papers (2023-07-19T17:58:03Z)
CIRCLE: Capture In Rich Contextual Environments [69.97976304918149]
We propose a novel motion acquisition system in which the actor perceives and operates in a highly contextual virtual world. We present CIRCLE, a dataset containing 10 hours of full-body reaching motion from 5 subjects across nine scenes. We use this dataset to train a model that generates human motion conditioned on scene information.
arXiv Detail & Related papers (2023-03-31T09:18:12Z)
FLAG3D: A 3D Fitness Activity Dataset with Language Instruction [89.60371681477791]
We present FLAG3D, a large-scale 3D fitness activity dataset with language instruction containing 180K sequences of 60 categories. We show that FLAG3D contributes great research value for various challenges, such as cross-domain human action recognition, dynamic human mesh recovery, and language-guided human action generation.
arXiv Detail & Related papers (2022-12-09T02:33:33Z)
The IKEA ASM Dataset: Understanding People Assembling Furniture through Actions, Objects and Pose [108.21037046507483]
IKEA ASM is a three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose. We benchmark prominent methods for video action recognition, object segmentation and human pose estimation tasks on this challenging dataset. The dataset enables the development of holistic methods, which integrate multi-modal and multi-view data to better perform on these tasks.
arXiv Detail & Related papers (2020-07-01T11:34:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.