Ego4D: Around the World in 3,000 Hours of Egocentric Video
- URL: http://arxiv.org/abs/2110.07058v1
- Date: Wed, 13 Oct 2021 22:19:32 GMT
- Title: Ego4D: Around the World in 3,000 Hours of Egocentric Video
- Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis,
Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu,
Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh
Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu,
Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent
Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph
Feichtenhofer, Adriano Fragomeni, Qichen Fu, Christian Fuegen, Abrham
Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei Huang,
Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico
Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava
Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola
Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey
Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu,
Takuma Yagi, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni
Maria Farinella, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul
Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park,
James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba,
Lorenzo Torresani, Mingfei Yan, Jitendra Malik
- Abstract summary: Ego4D is a massive-scale egocentric video dataset and benchmark suite.
It offers 3,025 hours of daily-life activity video spanning hundreds of scenarios captured by 855 unique camera wearers from 74 worldwide locations and 9 different countries.
Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event.
- Score: 276.1326075259486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark
suite. It offers 3,025 hours of daily-life activity video spanning hundreds of
scenarios (household, outdoor, workplace, leisure, etc.) captured by 855 unique
camera wearers from 74 worldwide locations and 9 different countries. The
approach to collection is designed to uphold rigorous privacy and ethics
standards with consenting participants and robust de-identification procedures
where relevant. Ego4D dramatically expands the volume of diverse egocentric
video footage publicly available to the research community. Portions of the
video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo,
and/or synchronized videos from multiple egocentric cameras at the same event.
Furthermore, we present a host of new benchmark challenges centered around
understanding the first-person visual experience in the past (querying an
episodic memory), present (analyzing hand-object manipulation, audio-visual
conversation, and social interactions), and future (forecasting activities). By
publicly sharing this massive annotated dataset and benchmark suite, we aim to
push the frontier of first-person perception. Project page:
https://ego4d-data.org/
Related papers
- Ego3DT: Tracking Every 3D Object in Ego-centric Videos [20.96550148331019]
This paper introduces a novel zero-shot approach for the 3D reconstruction and tracking of all objects from the ego-centric video.
We present Ego3DT, a novel framework that initially identifies and extracts detection and segmentation information of objects within the ego environment.
We have also innovated a dynamic hierarchical association mechanism for creating stable 3D tracking trajectories of objects in ego-centric videos.
arXiv Detail & Related papers (2024-10-11T05:02:31Z) - AMEGO: Active Memory from long EGOcentric videos [26.04157621755452]
We introduce AMEGO, a novel approach aimed at enhancing the comprehension of very-long egocentric videos.
Inspired by the human's ability to maintain information from a single watching, AMEGO focuses on constructing a self-contained representations from one egocentric video.
This representation is semantic-free and facilitates multiple queries without the need to reprocess the entire visual content.
arXiv Detail & Related papers (2024-09-17T06:18:47Z) - EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation [54.32133648259802]
We present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge.
Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo.
This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions.
arXiv Detail & Related papers (2024-06-26T05:01:37Z) - Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos.
We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z) - Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives [194.06650316685798]
Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities.
740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts.
Video accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions.
arXiv Detail & Related papers (2023-11-30T05:21:07Z) - Exploring adaptation of VideoMAE for Audio-Visual Diarization & Social @
Ego4d Looking at me Challenge [5.429147779652134]
VideoMAE is the data-efficient pretraining model for self-supervised video pre-training.
We show that the representation transferred from VideoMAE has good Spatio-temporal modeling.
arXiv Detail & Related papers (2022-11-17T06:49:57Z) - EgoEnv: Human-centric environment representations from egocentric video [60.34649902578047]
First-person video highlights a camera-wearer's activities in the context of their persistent environment.
Current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space.
We present an approach that links egocentric video and the environment by learning representations that are predictive of the camera-wearer's (potentially unseen) local surroundings.
arXiv Detail & Related papers (2022-07-22T22:39:57Z) - The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines [88.47608066382267]
We detail how this large-scale dataset was captured by 32 participants in their native kitchen environments.
Recording took place in 4 countries by participants belonging to 10 different nationalities.
Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.2K object bounding boxes.
arXiv Detail & Related papers (2020-04-29T21:57:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.