HD-EPIC: A Highly-Detailed Egocentric Video Dataset
- URL: http://arxiv.org/abs/2502.04144v1
- Date: Thu, 06 Feb 2025 15:25:05 GMT
- Title: HD-EPIC: A Highly-Detailed Egocentric Video Dataset
- Authors: Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, Jacob Chalk, Zhifan Zhu, Rhodri Guerrier, Fahd Abdelazim, Bin Zhu, Davide Moltisanti, Michael Wray, Hazel Doughty, Dima Damen,
- Abstract summary: HD-EPIC is 41 hours of video in 9 kitchens with digital twins of 413 kitchen fixtures, capturing 69 recipes, 59K fine-grained actions, 51K audio events, 20K object movements and 37K object masks lifted to 3D.
On average, we have 263 annotations per minute of our unscripted videos.
- Score: 35.957563351011935
- License:
- Abstract: We present a validation dataset of newly-collected kitchen-based egocentric videos, manually annotated with highly detailed and interconnected ground-truth labels covering: recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations. Importantly, all annotations are grounded in 3D through digital twinning of the scene, fixtures, object locations, and primed with gaze. Footage is collected from unscripted recordings in diverse home environments, making HDEPIC the first dataset collected in-the-wild but with detailed annotations matching those in controlled lab environments. We show the potential of our highly-detailed annotations through a challenging VQA benchmark of 26K questions assessing the capability to recognise recipes, ingredients, nutrition, fine-grained actions, 3D perception, object motion, and gaze direction. The powerful long-context Gemini Pro only achieves 38.5% on this benchmark, showcasing its difficulty and highlighting shortcomings in current VLMs. We additionally assess action recognition, sound recognition, and long-term video-object segmentation on HD-EPIC. HD-EPIC is 41 hours of video in 9 kitchens with digital twins of 413 kitchen fixtures, capturing 69 recipes, 59K fine-grained actions, 51K audio events, 20K object movements and 37K object masks lifted to 3D. On average, we have 263 annotations per minute of our unscripted videos.
Related papers
- PACE: A Large-Scale Dataset with Pose Annotations in Cluttered Environments [50.79058028754952]
PACE (Pose s in Cluttered Environments) is a large-scale benchmark for pose estimation methods in cluttered scenarios.
The benchmark consists of 55K frames with 258K annotations across 300 videos, covering 238 objects from 43 categories.
PACE-Sim contains 100K photo-realistic simulated frames with 2.4M annotations across 931 objects.
arXiv Detail & Related papers (2023-12-23T01:38:41Z) - EPIC Fields: Marrying 3D Geometry and Video Understanding [76.60638761589065]
EPIC Fields is an augmentation of EPIC-KITCHENS with 3D camera information.
It removes the complex and expensive step of reconstructing cameras using photogrammetry.
It reconstructs 96% of videos in EPICKITCHENS, registering 19M frames in 99 hours recorded in 45 kitchens.
arXiv Detail & Related papers (2023-06-14T20:33:49Z) - EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations [83.26326325568208]
We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video.
Specifically, we need to ensure both short- and long-term consistency of pixel-level annotations as objects undergo transformative interactions.
VISOR introduces an annotation pipeline, AI-powered in parts, for scalability and quality.
arXiv Detail & Related papers (2022-09-26T23:03:26Z) - Ego4D: Around the World in 3,000 Hours of Egocentric Video [276.1326075259486]
Ego4D is a massive-scale egocentric video dataset and benchmark suite.
It offers 3,025 hours of daily-life activity video spanning hundreds of scenarios captured by 855 unique camera wearers from 74 worldwide locations and 9 different countries.
Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event.
arXiv Detail & Related papers (2021-10-13T22:19:32Z) - The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines [88.47608066382267]
We detail how this large-scale dataset was captured by 32 participants in their native kitchen environments.
Recording took place in 4 countries by participants belonging to 10 different nationalities.
Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.2K object bounding boxes.
arXiv Detail & Related papers (2020-04-29T21:57:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.