ATTACH Dataset: Annotated Two-Handed Assembly Actions for Human Action
Understanding
- URL: http://arxiv.org/abs/2304.08210v1
- Date: Mon, 17 Apr 2023 12:31:24 GMT
- Title: ATTACH Dataset: Annotated Two-Handed Assembly Actions for Human Action
Understanding
- Authors: Dustin Aganian, Benedict Stephan, Markus Eisenbach, Corinna Stretz,
and Horst-Michael Gross
- Abstract summary: We present the ATTACH dataset, which contains 51.6 hours of assembly with 95.2k annotated fine-grained actions monitored by three cameras.
In the ATTACH dataset, more than 68% of annotations overlap with other annotations, which is many times more than in related datasets.
We report the performance of state-of-the-art methods for action recognition as well as action detection on video and skeleton-sequence inputs.
- Score: 8.923830513183882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the emergence of collaborative robots (cobots), human-robot
collaboration in industrial manufacturing is coming into focus. For a cobot to
act autonomously and as an assistant, it must understand human actions during
assembly. To effectively train models for this task, a dataset containing
suitable assembly actions in a realistic setting is crucial. For this purpose,
we present the ATTACH dataset, which contains 51.6 hours of assembly with 95.2k
annotated fine-grained actions monitored by three cameras, which represent
potential viewpoints of a cobot. Since in an assembly context workers tend to
perform different actions simultaneously with their two hands, we annotated the
performed actions for each hand separately. Therefore, in the ATTACH dataset,
more than 68% of annotations overlap with other annotations, which is many
times more than in related datasets, typically featuring more simplistic
assembly tasks. For better generalization with respect to the background of the
working area, we did not only record color and depth images, but also used the
Azure Kinect body tracking SDK for estimating 3D skeletons of the worker. To
create a first baseline, we report the performance of state-of-the-art methods
for action recognition as well as action detection on video and
skeleton-sequence inputs. The dataset is available at
https://www.tu-ilmenau.de/neurob/data-sets-code/attach-dataset .
Related papers
- Understanding Spatio-Temporal Relations in Human-Object Interaction using Pyramid Graph Convolutional Network [2.223052975765005]
We propose a novel Pyramid Graph Convolutional Network (PGCN) to automatically recognize human-object interaction.
The system represents the 2D or 3D spatial relation of human and objects from the detection results in video data as a graph.
We evaluate our model on two challenging datasets in the field of human-object interaction recognition.
arXiv Detail & Related papers (2024-10-10T13:39:17Z) - Language Supervised Human Action Recognition with Salient Fusion: Construction Worker Action Recognition as a Use Case [8.26451988845854]
We introduce a novel approach to Human Action Recognition (HAR) based on skeleton and visual cues.
We employ learnable prompts for the language model conditioned on the skeleton modality to optimize feature representation.
We introduce a new dataset tailored for real-world robotic applications in construction sites, featuring visual, skeleton, and depth data modalities.
arXiv Detail & Related papers (2024-10-02T19:10:23Z) - ADL4D: Towards A Contextually Rich Dataset for 4D Activities of Daily
Living [4.221961702292134]
ADL4D is a dataset of up to two subjects inter- acting with different sets of objects performing Activities of Daily Living (ADL)
Our dataset consists of 75 sequences with a total of 1.1M RGB-D frames, hand and object poses, and per-hand fine-grained action annotations.
We develop an automatic system for multi-view multi-hand 3D pose an- notation capable of tracking hand poses over time.
arXiv Detail & Related papers (2024-02-27T18:51:52Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly
Knowledge Understanding [5.233797258148846]
HA-ViD is the first human assembly video dataset that features representative industrial assembly scenarios.
We provide 3222 multi-view, multi-modality videos (each video contains one assembly task), 1.5M frames, 96K temporal labels and 2M spatial labels.
We benchmark four foundational video understanding tasks: action recognition, action segmentation, object detection and multi-object tracking.
arXiv Detail & Related papers (2023-07-09T08:44:46Z) - Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast
Contrastive Fusion [110.84357383258818]
We propose a novel approach to lift 2D segments to 3D and fuse them by means of a neural field representation.
The core of our approach is a slow-fast clustering objective function, which is scalable and well-suited for scenes with a large number of objects.
Our approach outperforms the state-of-the-art on challenging scenes from the ScanNet, Hypersim, and Replica datasets.
arXiv Detail & Related papers (2023-06-07T17:57:45Z) - Video-based Pose-Estimation Data as Source for Transfer Learning in
Human Activity Recognition [71.91734471596433]
Human Activity Recognition (HAR) using on-body devices identifies specific human actions in unconstrained environments.
Previous works demonstrated that transfer learning is a good strategy for addressing scenarios with scarce data.
This paper proposes using datasets intended for human-pose estimation as a source for transfer learning.
arXiv Detail & Related papers (2022-12-02T18:19:36Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - H2O: Two Hands Manipulating Objects for First Person Interaction
Recognition [70.46638409156772]
We present a comprehensive framework for egocentric interaction recognition using markerless 3D annotations of two hands manipulating objects.
Our method produces annotations of the 3D pose of two hands and the 6D pose of the manipulated objects, along with their interaction labels for each frame.
Our dataset, called H2O (2 Hands and Objects), provides synchronized multi-view RGB-D images, interaction labels, object classes, ground-truth 3D poses for left & right hands, 6D object poses, ground-truth camera poses, object meshes and scene point clouds.
arXiv Detail & Related papers (2021-04-22T17:10:42Z) - "What's This?" -- Learning to Segment Unknown Objects from Manipulation
Sequences [27.915309216800125]
We present a novel framework for self-supervised grasped object segmentation with a robotic manipulator.
We propose a single, end-to-end trainable architecture which jointly incorporates motion cues and semantic knowledge.
Our method neither depends on any visual registration of a kinematic robot or 3D object models, nor on precise hand-eye calibration or any additional sensor data.
arXiv Detail & Related papers (2020-11-06T10:55:28Z) - The IKEA ASM Dataset: Understanding People Assembling Furniture through
Actions, Objects and Pose [108.21037046507483]
IKEA ASM is a three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose.
We benchmark prominent methods for video action recognition, object segmentation and human pose estimation tasks on this challenging dataset.
The dataset enables the development of holistic methods, which integrate multi-modal and multi-view data to better perform on these tasks.
arXiv Detail & Related papers (2020-07-01T11:34:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.