EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models
- URL: http://arxiv.org/abs/2506.01608v1
- Date: Mon, 02 Jun 2025 12:46:44 GMT
- Title: EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models
- Authors: Andy Bonnetto, Haozhe Qi, Franklin Leong, Matea Tashkovska, Mahdi Rad, Solaiman Shokur, Friedhelm Hummel, Silvestro Micera, Marc Pollefeys, Alexander Mathis,
- Abstract summary: We introduce the EPFL-Smart-Kitchen-30 dataset, collected in a motion capture platform inside a kitchen environment.<n>Nine static RGB-D cameras, inertial measurement units (IMUs) and one head-mounted HoloLens2 headset were used to capture 3D hand, body, and eye movements.<n>The dataset is a multi-view action dataset with synchronized exocentric, egocentric, depth, IMUs, eye gaze, body and hand kinematics spanning 29.7 hours of 16 subjects cooking four different recipes.
- Score: 68.96292501521827
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding behavior requires datasets that capture humans while carrying out complex tasks. The kitchen is an excellent environment for assessing human motor and cognitive function, as many complex actions are naturally exhibited in kitchens from chopping to cleaning. Here, we introduce the EPFL-Smart-Kitchen-30 dataset, collected in a noninvasive motion capture platform inside a kitchen environment. Nine static RGB-D cameras, inertial measurement units (IMUs) and one head-mounted HoloLens~2 headset were used to capture 3D hand, body, and eye movements. The EPFL-Smart-Kitchen-30 dataset is a multi-view action dataset with synchronized exocentric, egocentric, depth, IMUs, eye gaze, body and hand kinematics spanning 29.7 hours of 16 subjects cooking four different recipes. Action sequences were densely annotated with 33.78 action segments per minute. Leveraging this multi-modal dataset, we propose four benchmarks to advance behavior understanding and modeling through 1) a vision-language benchmark, 2) a semantic text-to-motion generation benchmark, 3) a multi-modal action recognition benchmark, 4) a pose-based action segmentation benchmark. We expect the EPFL-Smart-Kitchen-30 dataset to pave the way for better methods as well as insights to understand the nature of ecologically-valid human behavior. Code and data are available at https://github.com/amathislab/EPFL-Smart-Kitchen
Related papers
- HUMOTO: A 4D Dataset of Mocap Human Object Interactions [27.573065832588554]
Human Motions with Objects is a high-fidelity dataset of human-object interactions for motion generation, computer vision, and robotics applications.<n>Humoto captures interactions with 63 precisely modeled objects and 72 articulated parts.<n>Professional artists rigorously clean and verify each sequence, minimizing foot sliding and object penetrations.
arXiv Detail & Related papers (2025-04-14T16:59:29Z) - HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos [9.513100627302755]
The dataset offers over 833 minutes (3.7M+ images) of recordings that feature 19 subjects interacting with 33 diverse rigid objects.<n>The recordings include multiple synchronized data streams containing egocentric multi-view RGB/monochrome images, eye gaze signal, scene point clouds, and 3D poses of cameras, hands, and objects.<n>In our experiments, we demonstrate the effectiveness of multi-view egocentric data for three popular tasks: 3D hand tracking, model-based 6DoF object pose estimation, and 3D lifting of unknown in-hand objects.
arXiv Detail & Related papers (2024-11-28T14:09:42Z) - Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild [66.34146236875822]
The Nymeria dataset is a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices.
It contains 1200 recordings of 300 hours of daily activities from 264 participants across 50 locations, travelling a total of 399Km.
The motion-language descriptions provide 310.5K sentences in 8.64M words from a vocabulary size of 6545.
arXiv Detail & Related papers (2024-06-14T10:23:53Z) - Introducing HOT3D: An Egocentric Dataset for 3D Hand and Object Tracking [7.443420525809604]
We introduce HOT3D, a dataset for egocentric hand and object tracking in 3D.
The dataset offers over 833 minutes of multi-view RGB/monochrome image streams showing 19 subjects interacting with 33 diverse rigid objects.
In addition to simple pick-up/observe/put-down actions, HOT3D contains scenarios resembling typical actions in a kitchen, office, and living room environment.
arXiv Detail & Related papers (2024-06-13T21:38:17Z) - EPIC Fields: Marrying 3D Geometry and Video Understanding [76.60638761589065]
EPIC Fields is an augmentation of EPIC-KITCHENS with 3D camera information.
It removes the complex and expensive step of reconstructing cameras using photogrammetry.
It reconstructs 96% of videos in EPICKITCHENS, registering 19M frames in 99 hours recorded in 45 kitchens.
arXiv Detail & Related papers (2023-06-14T20:33:49Z) - FLAG3D: A 3D Fitness Activity Dataset with Language Instruction [89.60371681477791]
We present FLAG3D, a large-scale 3D fitness activity dataset with language instruction containing 180K sequences of 60 categories.
We show that FLAG3D contributes great research value for various challenges, such as cross-domain human action recognition, dynamic human mesh recovery, and language-guided human action generation.
arXiv Detail & Related papers (2022-12-09T02:33:33Z) - EgoBody: Human Body Shape, Motion and Social Interactions from
Head-Mounted Devices [76.50816193153098]
EgoBody is a novel large-scale dataset for social interactions in complex 3D scenes.
We employ Microsoft HoloLens2 headsets to record rich egocentric data streams including RGB, depth, eye gaze, head and hand tracking.
To obtain accurate 3D ground-truth, we calibrate the headset with a multi-Kinect rig and fit expressive SMPL-X body meshes to multi-view RGB-D frames.
arXiv Detail & Related papers (2021-12-14T18:41:28Z) - The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines [88.47608066382267]
We detail how this large-scale dataset was captured by 32 participants in their native kitchen environments.
Recording took place in 4 countries by participants belonging to 10 different nationalities.
Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.2K object bounding boxes.
arXiv Detail & Related papers (2020-04-29T21:57:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.