Related papers: Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities

Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities

URL: http://arxiv.org/abs/2203.14712v1
Date: Mon, 28 Mar 2022 12:59:50 GMT
Title: Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities
Authors: Fadime Sener and Dibyadip Chatterjee and Daniel Shelepov and Kun He and Dipika Singhania and Robert Wang and Angela Yao
Abstract summary: Assembly101 is a new procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 "take-apart" toy vehicles. Participants work without fixed instructions, and the sequences feature rich and natural variations in action ordering, mistakes, and corrections. Sequences are annotated with more than 100K coarse and 1M fine-grained action segments, and 18M 3D hand poses.
Score: 29.05606394634704
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Assembly101 is a new procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 "take-apart" toy vehicles. Participants work without fixed instructions, and the sequences feature rich and natural variations in action ordering, mistakes, and corrections. Assembly101 is the first multi-view action dataset, with simultaneous static (8) and egocentric (4) recordings. Sequences are annotated with more than 100K coarse and 1M fine-grained action segments, and 18M 3D hand poses. We benchmark on three action understanding tasks: recognition, anticipation and temporal segmentation. Additionally, we propose a novel task of detecting mistakes. The unique recording format and rich set of annotations allow us to investigate generalization to new toys, cross-view transfer, long-tailed distributions, and pose vs. appearance. We envision that Assembly101 will serve as a new challenge to investigate various activity understanding problems.

Related papers

Pose-Aware Weakly-Supervised Action Segmentation [11.154829751558006]
We introduce a weakly-supervised framework that incorporates pose knowledge during training while omitting its use during inference. We propose a pose-inspired contrastive loss as a part of the whole framework which is trained to distinguish action boundaries more effectively. Our approach, validated through extensive experiments on representative datasets, outperforms previous state-of-the-art (SOTA) in segmenting long instructional videos.
arXiv Detail & Related papers (2025-04-08T05:42:55Z)
About Time: Advances, Challenges, and Outlooks of Action Understanding [57.76390141287026]
This survey comprehensively reviews advances in uni- and multi-modal action understanding across a range of tasks. We focus on prevalent challenges, overview widely adopted datasets, and survey seminal works with an emphasis on recent advances.
arXiv Detail & Related papers (2024-11-22T18:09:27Z)
CaptainCook4D: A Dataset for Understanding Errors in Procedural Activities [12.38265411170993]
We collect a new egocentric 4D dataset, CaptainCook4D, comprising 384 recordings (94.5 hours) of people performing recipes in real kitchen environments. This dataset consists of two distinct types of activity: one in which participants adhere to the provided recipe instructions and another in which they deviate and induce errors.
arXiv Detail & Related papers (2023-12-22T09:29:45Z)
Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class. We decompose the video into short clips, where a visual encoder extracts features from each clip independently. Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z)
Every Mistake Counts in Assembly [26.903961683742494]
We propose a system that can detect ordering mistakes by utilizing a learned knowledge base. Our framework constructs a knowledge base with spatial and temporal beliefs based on observed mistakes. We demonstrate experimentally that our inferred spatial and temporal beliefs are capable of identifying incorrect orderings in real-world action sequences.
arXiv Detail & Related papers (2023-07-31T07:20:31Z)
HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly Knowledge Understanding [5.233797258148846]
HA-ViD is the first human assembly video dataset that features representative industrial assembly scenarios. We provide 3222 multi-view, multi-modality videos (each video contains one assembly task), 1.5M frames, 96K temporal labels and 2M spatial labels. We benchmark four foundational video understanding tasks: action recognition, action segmentation, object detection and multi-object tracking.
arXiv Detail & Related papers (2023-07-09T08:44:46Z)
Aligning Step-by-Step Instructional Diagrams to Video Demonstrations [51.67930509196712]
We consider a novel setting where alignment is between (i) instruction steps that are depicted as assembly diagrams and (ii) video segments from in-the-wild videos. We introduce a novel supervised contrastive learning method that learns to align videos with the subtle details in the assembly diagrams. Experiments on IAW for Ikea assembly in the wild demonstrate superior performances of our approach against alternatives.
arXiv Detail & Related papers (2023-03-24T04:45:45Z)
Learning to Refactor Action and Co-occurrence Features for Temporal Action Localization [74.74339878286935]
Action features and co-occurrence features often dominate the actual action content in videos. We develop a novel auxiliary task by decoupling these two types of features within a video snippet. We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features.
arXiv Detail & Related papers (2022-06-23T06:30:08Z)
Learning To Recognize Procedural Activities with Distant Supervision [96.58436002052466]
We consider the problem of classifying fine-grained, multi-step activities from long videos spanning up to several minutes. Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base.
arXiv Detail & Related papers (2022-01-26T15:06:28Z)
Rescaling Egocentric Vision [48.57283024015145]
This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos. Compared to its previous version, EPIC-KITCHENS-100 has been annotated using a novel pipeline that allows denser (54% more actions per minute) and more complete annotations of fine-grained actions (+128% more action segments)
arXiv Detail & Related papers (2020-06-23T18:28:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.