Rescaling Egocentric Vision
- URL: http://arxiv.org/abs/2006.13256v4
- Date: Fri, 17 Sep 2021 17:17:48 GMT
- Title: Rescaling Egocentric Vision
- Authors: Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari,
Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett,
Will Price, Michael Wray
- Abstract summary: This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS.
The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos.
Compared to its previous version, EPIC-KITCHENS-100 has been annotated using a novel pipeline that allows denser (54% more actions per minute) and more complete annotations of fine-grained actions (+128% more action segments)
- Score: 48.57283024015145
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper introduces the pipeline to extend the largest dataset in
egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a
collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos,
capturing long-term unscripted activities in 45 environments, using
head-mounted cameras. Compared to its previous version, EPIC-KITCHENS-100 has
been annotated using a novel pipeline that allows denser (54% more actions per
minute) and more complete annotations of fine-grained actions (+128% more
action segments). This collection enables new challenges such as action
detection and evaluating the "test of time" - i.e. whether models trained on
data collected in 2018 can generalise to new footage collected two years later.
The dataset is aligned with 6 challenges: action recognition (full and weak
supervision), action detection, action anticipation, cross-modal retrieval
(from captions), as well as unsupervised domain adaptation for action
recognition. For each challenge, we define the task, provide baselines and
evaluation metrics
Related papers
- AIM 2024 Challenge on Video Saliency Prediction: Methods and Results [105.09572982350532]
This paper reviews the Challenge on Video Saliency Prediction at AIM 2024.
The goal of the participants was to develop a method for predicting accurate saliency maps for the provided set of video sequences.
arXiv Detail & Related papers (2024-09-23T08:59:22Z) - Harnessing Temporal Causality for Advanced Temporal Action Detection [53.654457142657236]
We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on benchmarks.
We ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, and 1st in the Moment Queries track at the Ego4D Challenge 2024.
arXiv Detail & Related papers (2024-07-25T06:03:02Z) - DailyDVS-200: A Comprehensive Benchmark Dataset for Event-Based Action Recognition [51.96660522869841]
DailyDVS-200 is a benchmark dataset tailored for the event-based action recognition community.
It covers 200 action categories across real-world scenarios, recorded by 47 participants, and comprises more than 22,000 event sequences.
DailyDVS-200 is annotated with 14 attributes, ensuring a detailed characterization of the recorded actions.
arXiv Detail & Related papers (2024-07-06T15:25:10Z) - ReLER@ZJU Submission to the Ego4D Moment Queries Challenge 2022 [42.02602065259257]
We present the ReLER@ZJU1 submission to the Ego4D Moment Queries Challenge in ECCV 2022.
The goal is to retrieve and localize all instances of possible activities in egocentric videos.
The final submission achieved Recall@1,tIoU=0.5 score of 37.24, average mAP score of 17.67 and took 3-rd place on the leaderboard.
arXiv Detail & Related papers (2022-11-17T14:28:31Z) - NVIDIA-UNIBZ Submission for EPIC-KITCHENS-100 Action Anticipation
Challenge 2022 [13.603712913129506]
We describe the technical details of our submission for the EPIC-Kitchen-100 action anticipation challenge.
Our modelings, the higher-order recurrent space-time transformer and the message-passing neural network with edge learning, are both recurrent-based architectures which observe only 2.5 seconds inference context to form the action anticipation prediction.
By averaging the prediction scores from a set of models compiled with our proposed training pipeline, we achieved strong performance on the test set, which is 19.61% overall mean top-5 recall, recorded as second place on the public leaderboard.
arXiv Detail & Related papers (2022-06-22T06:34:58Z) - Assembly101: A Large-Scale Multi-View Video Dataset for Understanding
Procedural Activities [29.05606394634704]
Assembly101 is a new procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 "take-apart" toy vehicles.
Participants work without fixed instructions, and the sequences feature rich and natural variations in action ordering, mistakes, and corrections.
Sequences are annotated with more than 100K coarse and 1M fine-grained action segments, and 18M 3D hand poses.
arXiv Detail & Related papers (2022-03-28T12:59:50Z) - Woodscape Fisheye Semantic Segmentation for Autonomous Driving -- CVPR
2021 OmniCV Workshop Challenge [2.3469719108972504]
WoodScape fisheye semantic segmentation challenge for autonomous driving was held as part of the CVPR 2021 Workshop on Omnidirectional Computer Vision.
We provide a summary of the competition which attracted the participation of 71 global teams and a total of 395 submissions.
The top teams recorded significantly improved mean IoU and accuracy scores over the baseline PSPNet with ResNet-50 backbone.
arXiv Detail & Related papers (2021-07-17T14:32:58Z) - A Stronger Baseline for Ego-Centric Action Detection [38.934802199184354]
This report analyzes an egocentric video action detection method we used in the 2021 EPIC-KITCHENS-100 competition hosted in CVPR 2021 Workshop.
The goal of our task is to locate the start time and the end time of the action in the long untrimmed video, and predict action category.
We adopt sliding window strategy to generate proposals, which can better adapt to short-duration actions.
arXiv Detail & Related papers (2021-06-13T08:11:31Z) - Anticipative Video Transformer [105.20878510342551]
Anticipative Video Transformer (AVT) is an end-to-end attention-based video modeling architecture.
We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames' features.
arXiv Detail & Related papers (2021-06-03T17:57:55Z) - ZSTAD: Zero-Shot Temporal Activity Detection [107.63759089583382]
We propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training can still be detected.
We design an end-to-end deep network based on R-C3D as the architecture for this solution.
Experiments on both the THUMOS14 and the Charades datasets show promising performance in terms of detecting unseen activities.
arXiv Detail & Related papers (2020-03-12T02:40:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.