HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly
Knowledge Understanding
- URL: http://arxiv.org/abs/2307.05721v1
- Date: Sun, 9 Jul 2023 08:44:46 GMT
- Title: HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly
Knowledge Understanding
- Authors: Hao Zheng, Regina Lee, Yuqian Lu
- Abstract summary: HA-ViD is the first human assembly video dataset that features representative industrial assembly scenarios.
We provide 3222 multi-view, multi-modality videos (each video contains one assembly task), 1.5M frames, 96K temporal labels and 2M spatial labels.
We benchmark four foundational video understanding tasks: action recognition, action segmentation, object detection and multi-object tracking.
- Score: 5.233797258148846
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Understanding comprehensive assembly knowledge from videos is critical for
futuristic ultra-intelligent industry. To enable technological breakthrough, we
present HA-ViD - the first human assembly video dataset that features
representative industrial assembly scenarios, natural procedural knowledge
acquisition process, and consistent human-robot shared annotations.
Specifically, HA-ViD captures diverse collaboration patterns of real-world
assembly, natural human behaviors and learning progression during assembly, and
granulate action annotations to subject, action verb, manipulated object,
target object, and tool. We provide 3222 multi-view, multi-modality videos
(each video contains one assembly task), 1.5M frames, 96K temporal labels and
2M spatial labels. We benchmark four foundational video understanding tasks:
action recognition, action segmentation, object detection and multi-object
tracking. Importantly, we analyze their performance for comprehending knowledge
in assembly progress, process efficiency, task collaboration, skill parameters
and human intention. Details of HA-ViD is available at:
https://iai-hrc.github.io/ha-vid.
Related papers
- HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data [55.739633494946204]
We present HumanVBench, an innovative benchmark meticulously crafted to bridge gaps in the evaluation of video MLLMs.
HumanVBench comprises 17 carefully designed tasks that explore two primary dimensions: inner emotion and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects.
arXiv Detail & Related papers (2024-12-23T13:45:56Z) - Holistic Understanding of 3D Scenes as Universal Scene Description [56.69740649781989]
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI.
We introduce an expertly curated dataset in the Universal Scene Description (USD) format featuring high-quality manual annotations.
With its broad and high-quality annotations, the data provides the basis for holistic 3D scene understanding models.
arXiv Detail & Related papers (2024-12-02T11:33:55Z) - EAGLE: Egocentric AGgregated Language-video Engine [34.60423566630983]
We introduce the Eagle (Egocentric AGgregated Language-video Engine) model and the Eagle-400K dataset to provide a unified framework that integrates various egocentric video understanding tasks.
Egocentric video analysis brings new insights into understanding human activities and intentions from a first-person perspective.
arXiv Detail & Related papers (2024-09-26T04:17:27Z) - VISA: Reasoning Video Object Segmentation via Large Language Models [64.33167989521357]
We introduce a new task, Reasoning Video Object (ReasonVOS)
This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities.
We introduce VISA (Video-based large language Instructed Assistant) to tackle ReasonVOS.
arXiv Detail & Related papers (2024-07-16T02:29:29Z) - Look, Remember and Reason: Grounded reasoning in videos with language
models [5.3445140425713245]
Multi-temporal language models (LM) have recently shown promising performance in high-level reasoning tasks on videos.
We propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, tracking, to endow the model with the required low-level visual capabilities.
We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets.
arXiv Detail & Related papers (2023-06-30T16:31:14Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - ATTACH Dataset: Annotated Two-Handed Assembly Actions for Human Action
Understanding [8.923830513183882]
We present the ATTACH dataset, which contains 51.6 hours of assembly with 95.2k annotated fine-grained actions monitored by three cameras.
In the ATTACH dataset, more than 68% of annotations overlap with other annotations, which is many times more than in related datasets.
We report the performance of state-of-the-art methods for action recognition as well as action detection on video and skeleton-sequence inputs.
arXiv Detail & Related papers (2023-04-17T12:31:24Z) - Video Exploration via Video-Specific Autoencoders [60.256055890647595]
We present video-specific autoencoders that enables human-controllable video exploration.
We observe that a simple autoencoder trained on multiple frames of a specific video enables one to perform a large variety of video processing and editing tasks.
arXiv Detail & Related papers (2021-03-31T17:56:13Z) - The IKEA ASM Dataset: Understanding People Assembling Furniture through
Actions, Objects and Pose [108.21037046507483]
IKEA ASM is a three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose.
We benchmark prominent methods for video action recognition, object segmentation and human pose estimation tasks on this challenging dataset.
The dataset enables the development of holistic methods, which integrate multi-modal and multi-view data to better perform on these tasks.
arXiv Detail & Related papers (2020-07-01T11:34:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.