Assembly101: A Large-Scale Multi-View Video Dataset for Understanding
Procedural Activities
- URL: http://arxiv.org/abs/2203.14712v1
- Date: Mon, 28 Mar 2022 12:59:50 GMT
- Title: Assembly101: A Large-Scale Multi-View Video Dataset for Understanding
Procedural Activities
- Authors: Fadime Sener and Dibyadip Chatterjee and Daniel Shelepov and Kun He
and Dipika Singhania and Robert Wang and Angela Yao
- Abstract summary: Assembly101 is a new procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 "take-apart" toy vehicles.
Participants work without fixed instructions, and the sequences feature rich and natural variations in action ordering, mistakes, and corrections.
Sequences are annotated with more than 100K coarse and 1M fine-grained action segments, and 18M 3D hand poses.
- Score: 29.05606394634704
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Assembly101 is a new procedural activity dataset featuring 4321 videos of
people assembling and disassembling 101 "take-apart" toy vehicles. Participants
work without fixed instructions, and the sequences feature rich and natural
variations in action ordering, mistakes, and corrections. Assembly101 is the
first multi-view action dataset, with simultaneous static (8) and egocentric
(4) recordings. Sequences are annotated with more than 100K coarse and 1M
fine-grained action segments, and 18M 3D hand poses. We benchmark on three
action understanding tasks: recognition, anticipation and temporal
segmentation. Additionally, we propose a novel task of detecting mistakes. The
unique recording format and rich set of annotations allow us to investigate
generalization to new toys, cross-view transfer, long-tailed distributions, and
pose vs. appearance. We envision that Assembly101 will serve as a new challenge
to investigate various activity understanding problems.
Related papers
- CaptainCook4D: A Dataset for Understanding Errors in Procedural Activities [12.38265411170993]
We collect a new egocentric 4D dataset, CaptainCook4D, comprising 384 recordings (94.5 hours) of people performing recipes in real kitchen environments.
This dataset consists of two distinct types of activity: one in which participants adhere to the provided recipe instructions and another in which they deviate and induce errors.
arXiv Detail & Related papers (2023-12-22T09:29:45Z) - Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - Every Mistake Counts in Assembly [26.903961683742494]
We propose a system that can detect ordering mistakes by utilizing a learned knowledge base.
Our framework constructs a knowledge base with spatial and temporal beliefs based on observed mistakes.
We demonstrate experimentally that our inferred spatial and temporal beliefs are capable of identifying incorrect orderings in real-world action sequences.
arXiv Detail & Related papers (2023-07-31T07:20:31Z) - HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly
Knowledge Understanding [5.233797258148846]
HA-ViD is the first human assembly video dataset that features representative industrial assembly scenarios.
We provide 3222 multi-view, multi-modality videos (each video contains one assembly task), 1.5M frames, 96K temporal labels and 2M spatial labels.
We benchmark four foundational video understanding tasks: action recognition, action segmentation, object detection and multi-object tracking.
arXiv Detail & Related papers (2023-07-09T08:44:46Z) - Aligning Step-by-Step Instructional Diagrams to Video Demonstrations [51.67930509196712]
We consider a novel setting where alignment is between (i) instruction steps that are depicted as assembly diagrams and (ii) video segments from in-the-wild videos.
We introduce a novel supervised contrastive learning method that learns to align videos with the subtle details in the assembly diagrams.
Experiments on IAW for Ikea assembly in the wild demonstrate superior performances of our approach against alternatives.
arXiv Detail & Related papers (2023-03-24T04:45:45Z) - Learning to Refactor Action and Co-occurrence Features for Temporal
Action Localization [74.74339878286935]
Action features and co-occurrence features often dominate the actual action content in videos.
We develop a novel auxiliary task by decoupling these two types of features within a video snippet.
We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features.
arXiv Detail & Related papers (2022-06-23T06:30:08Z) - Learning To Recognize Procedural Activities with Distant Supervision [96.58436002052466]
We consider the problem of classifying fine-grained, multi-step activities from long videos spanning up to several minutes.
Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base.
arXiv Detail & Related papers (2022-01-26T15:06:28Z) - The IKEA ASM Dataset: Understanding People Assembling Furniture through
Actions, Objects and Pose [108.21037046507483]
IKEA ASM is a three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose.
We benchmark prominent methods for video action recognition, object segmentation and human pose estimation tasks on this challenging dataset.
The dataset enables the development of holistic methods, which integrate multi-modal and multi-view data to better perform on these tasks.
arXiv Detail & Related papers (2020-07-01T11:34:46Z) - Rescaling Egocentric Vision [48.57283024015145]
This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS.
The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos.
Compared to its previous version, EPIC-KITCHENS-100 has been annotated using a novel pipeline that allows denser (54% more actions per minute) and more complete annotations of fine-grained actions (+128% more action segments)
arXiv Detail & Related papers (2020-06-23T18:28:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.