The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines
- URL: http://arxiv.org/abs/2005.00343v1
- Date: Wed, 29 Apr 2020 21:57:04 GMT
- Title: The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines
- Authors: Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler,
Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby
Perrett, Will Price, Michael Wray
- Abstract summary: We detail how this large-scale dataset was captured by 32 participants in their native kitchen environments.
Recording took place in 4 countries by participants belonging to 10 different nationalities.
Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.2K object bounding boxes.
- Score: 88.47608066382267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since its introduction in 2018, EPIC-KITCHENS has attracted attention as the
largest egocentric video benchmark, offering a unique viewpoint on people's
interaction with objects, their attention, and even intention. In this paper,
we detail how this large-scale dataset was captured by 32 participants in their
native kitchen environments, and densely annotated with actions and object
interactions. Our videos depict nonscripted daily activities, as recording is
started every time a participant entered their kitchen. Recording took place in
4 countries by participants belonging to 10 different nationalities, resulting
in highly diverse kitchen habits and cooking styles. Our dataset features 55
hours of video consisting of 11.5M frames, which we densely labelled for a
total of 39.6K action segments and 454.2K object bounding boxes. Our annotation
is unique in that we had the participants narrate their own videos after
recording, thus reflecting true intention, and we crowd-sourced ground-truths
based on these. We describe our object, action and. anticipation challenges,
and evaluate several baselines over two test splits, seen and unseen kitchens.
We introduce new baselines that highlight the multimodal nature of the dataset
and the importance of explicit temporal modelling to discriminate fine-grained
actions e.g. 'closing a tap' from 'opening' it up.
Related papers
- Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild [66.34146236875822]
The Nymeria dataset is a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices.
It contains 1200 recordings of 300 hours of daily activities from 264 participants across 50 locations, travelling a total of 399Km.
The motion-language descriptions provide 310.5K sentences in 8.64M words from a vocabulary size of 6545.
arXiv Detail & Related papers (2024-06-14T10:23:53Z) - Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives [194.06650316685798]
Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities.
740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts.
Video accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions.
arXiv Detail & Related papers (2023-11-30T05:21:07Z) - EPIC Fields: Marrying 3D Geometry and Video Understanding [76.60638761589065]
EPIC Fields is an augmentation of EPIC-KITCHENS with 3D camera information.
It removes the complex and expensive step of reconstructing cameras using photogrammetry.
It reconstructs 96% of videos in EPICKITCHENS, registering 19M frames in 99 hours recorded in 45 kitchens.
arXiv Detail & Related papers (2023-06-14T20:33:49Z) - EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations [83.26326325568208]
We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video.
Specifically, we need to ensure both short- and long-term consistency of pixel-level annotations as objects undergo transformative interactions.
VISOR introduces an annotation pipeline, AI-powered in parts, for scalability and quality.
arXiv Detail & Related papers (2022-09-26T23:03:26Z) - Ego4D: Around the World in 3,000 Hours of Egocentric Video [276.1326075259486]
Ego4D is a massive-scale egocentric video dataset and benchmark suite.
It offers 3,025 hours of daily-life activity video spanning hundreds of scenarios captured by 855 unique camera wearers from 74 worldwide locations and 9 different countries.
Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event.
arXiv Detail & Related papers (2021-10-13T22:19:32Z) - MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized
Sports Actions [39.27858380391081]
This paper aims to present a new multi-person dataset of atomic-temporal actions, coined as MultiSports.
We build the dataset of MultiSports v1.0 by selecting 4 sports classes, collecting around 3200 video clips, and annotating around 37790 action instances with 907k bounding boxes.
arXiv Detail & Related papers (2021-05-16T10:40:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.