The IKEA ASM Dataset: Understanding People Assembling Furniture through
Actions, Objects and Pose
- URL: http://arxiv.org/abs/2007.00394v2
- Date: Wed, 17 May 2023 07:56:52 GMT
- Title: The IKEA ASM Dataset: Understanding People Assembling Furniture through
Actions, Objects and Pose
- Authors: Yizhak Ben-Shabat, Xin Yu, Fatemeh Sadat Saleh, Dylan Campbell,
Cristian Rodriguez-Opazo, Hongdong Li, Stephen Gould
- Abstract summary: IKEA ASM is a three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose.
We benchmark prominent methods for video action recognition, object segmentation and human pose estimation tasks on this challenging dataset.
The dataset enables the development of holistic methods, which integrate multi-modal and multi-view data to better perform on these tasks.
- Score: 108.21037046507483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The availability of a large labeled dataset is a key requirement for applying
deep learning methods to solve various computer vision tasks. In the context of
understanding human activities, existing public datasets, while large in size,
are often limited to a single RGB camera and provide only per-frame or per-clip
action annotations. To enable richer analysis and understanding of human
activities, we introduce IKEA ASM -- a three million frame, multi-view,
furniture assembly video dataset that includes depth, atomic actions, object
segmentation, and human pose. Additionally, we benchmark prominent methods for
video action recognition, object segmentation and human pose estimation tasks
on this challenging dataset. The dataset enables the development of holistic
methods, which integrate multi-modal and multi-view data to better perform on
these tasks.
Related papers
- ADL4D: Towards A Contextually Rich Dataset for 4D Activities of Daily
Living [4.221961702292134]
ADL4D is a dataset of up to two subjects inter- acting with different sets of objects performing Activities of Daily Living (ADL)
Our dataset consists of 75 sequences with a total of 1.1M RGB-D frames, hand and object poses, and per-hand fine-grained action annotations.
We develop an automatic system for multi-view multi-hand 3D pose an- notation capable of tracking hand poses over time.
arXiv Detail & Related papers (2024-02-27T18:51:52Z) - SM$^3$: Self-Supervised Multi-task Modeling with Multi-view 2D Images
for Articulated Objects [24.737865259695006]
We propose a self-supervised interaction perception method, referred to as SM$3$, to model articulated objects.
By constructing 3D geometries and textures from the captured 2D images, SM$3$ achieves integrated optimization of movable part and joint parameters.
Evaluations demonstrate that SM$3$ surpasses existing benchmarks across various categories and objects, while its adaptability in real-world scenarios has been thoroughly validated.
arXiv Detail & Related papers (2024-01-17T11:15:09Z) - Weakly Supervised Multi-Task Representation Learning for Human Activity
Analysis Using Wearables [2.398608007786179]
We propose a weakly supervised multi-output siamese network that learns to map the data into multiple representation spaces.
The representation of the data samples are positioned in the space such that the data with the same semantic meaning in that aspect are closely located to each other.
arXiv Detail & Related papers (2023-08-06T08:20:07Z) - HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly
Knowledge Understanding [5.233797258148846]
HA-ViD is the first human assembly video dataset that features representative industrial assembly scenarios.
We provide 3222 multi-view, multi-modality videos (each video contains one assembly task), 1.5M frames, 96K temporal labels and 2M spatial labels.
We benchmark four foundational video understanding tasks: action recognition, action segmentation, object detection and multi-object tracking.
arXiv Detail & Related papers (2023-07-09T08:44:46Z) - Towards Multimodal Multitask Scene Understanding Models for Indoor
Mobile Agents [49.904531485843464]
In this paper, we discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments.
We describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges.
MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks.
We show that MMISM performs on par or even better than single-task models.
arXiv Detail & Related papers (2022-09-27T04:49:19Z) - ObjectFolder: A Dataset of Objects with Implicit Visual, Auditory, and
Tactile Representations [52.226947570070784]
We present Object, a dataset of 100 objects that addresses both challenges with two key innovations.
First, Object encodes the visual, auditory, and tactile sensory data for all objects, enabling a number of multisensory object recognition tasks.
Second, Object employs a uniform, object-centric simulations, and implicit representation for each object's visual textures, tactile readings, and tactile readings, making the dataset flexible to use and easy to share.
arXiv Detail & Related papers (2021-09-16T14:00:59Z) - JRDB-Act: A Large-scale Multi-modal Dataset for Spatio-temporal Action,
Social Group and Activity Detection [54.696819174421584]
We introduce JRDB-Act, a multi-modal dataset that reflects a real distribution of human daily life actions in a university campus environment.
JRDB-Act has been densely annotated with atomic actions, comprises over 2.8M action labels.
JRDB-Act comes with social group identification annotations conducive to the task of grouping individuals based on their interactions in the scene.
arXiv Detail & Related papers (2021-06-16T14:43:46Z) - REGRAD: A Large-Scale Relational Grasp Dataset for Safe and
Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps.
Our dataset is collected in both forms of 2D images and 3D point clouds.
Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z) - LEMMA: A Multi-view Dataset for Learning Multi-agent Multi-task
Activities [119.88381048477854]
We introduce the LEMMA dataset to provide a single home to address missing dimensions with meticulously designed settings.
We densely annotate the atomic-actions with human-object interactions to provide ground-truths of the compositionality, scheduling, and assignment of daily activities.
We hope this effort would drive the machine vision community to examine goal-directed human activities and further study the task scheduling and assignment in the real world.
arXiv Detail & Related papers (2020-07-31T00:13:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.