CaptainCook4D: A Dataset for Understanding Errors in Procedural Activities
- URL: http://arxiv.org/abs/2312.14556v3
- Date: Fri, 01 Nov 2024 16:12:52 GMT
- Title: CaptainCook4D: A Dataset for Understanding Errors in Procedural Activities
- Authors: Rohith Peddi, Shivvrat Arya, Bharath Challa, Likhitha Pallapothula, Akshay Vyas, Bhavya Gouripeddi, Jikai Wang, Qifan Zhang, Vasundhara Komaragiri, Eric Ragan, Nicholas Ruozzi, Yu Xiang, Vibhav Gogate,
- Abstract summary: We collect a new egocentric 4D dataset, CaptainCook4D, comprising 384 recordings (94.5 hours) of people performing recipes in real kitchen environments.
This dataset consists of two distinct types of activity: one in which participants adhere to the provided recipe instructions and another in which they deviate and induce errors.
- Score: 12.38265411170993
- License:
- Abstract: Following step-by-step procedures is an essential component of various activities carried out by individuals in their daily lives. These procedures serve as a guiding framework that helps to achieve goals efficiently, whether it is assembling furniture or preparing a recipe. However, the complexity and duration of procedural activities inherently increase the likelihood of making errors. Understanding such procedural activities from a sequence of frames is a challenging task that demands an accurate interpretation of visual information and the ability to reason about the structure of the activity. To this end, we collect a new egocentric 4D dataset, CaptainCook4D, comprising 384 recordings (94.5 hours) of people performing recipes in real kitchen environments. This dataset consists of two distinct types of activity: one in which participants adhere to the provided recipe instructions and another in which they deviate and induce errors. We provide 5.3K step annotations and 10K fine-grained action annotations and benchmark the dataset for the following tasks: supervised error recognition, multistep localization, and procedure learning
Related papers
- NaturalVLM: Leveraging Fine-grained Natural Language for
Affordance-Guided Visual Manipulation [21.02437461550044]
Many real-world tasks demand intricate multi-step reasoning.
We introduce a benchmark, NrVLM, comprising 15 distinct manipulation tasks.
We propose a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions.
arXiv Detail & Related papers (2024-03-13T09:12:16Z) - Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos [16.333295670635557]
We explore the capability of an agent to construct a logical sequence of action steps, thereby assembling a strategic procedural plan.
This plan is crucial for navigating from an initial visual observation to a target visual outcome, as depicted in real-life instructional videos.
We coin our approach KEPP, a novel Knowledge-Enhanced Procedure Planning system, which harnesses a probabilistic procedural knowledge graph extracted from training data.
arXiv Detail & Related papers (2024-03-05T08:55:51Z) - Data-CUBE: Data Curriculum for Instruction-based Sentence Representation
Learning [85.66907881270785]
We propose a data curriculum method, namely Data-CUBE, that arranges the orders of all the multi-task data for training.
In the task level, we aim to find the optimal task order to minimize the total cross-task interference risk.
In the instance level, we measure the difficulty of all instances per task, then divide them into the easy-to-difficult mini-batches for training.
arXiv Detail & Related papers (2024-01-07T18:12:20Z) - IndustReal: A Dataset for Procedure Step Recognition Handling Execution
Errors in Egocentric Videos in an Industrial-Like Setting [7.561148568365396]
We present the novel task of procedure step recognition (PSR)
PSR focuses on recognizing the correct completion and order of procedural steps.
We also present the multi-modal IndustReal dataset.
arXiv Detail & Related papers (2023-10-26T11:44:29Z) - Non-Sequential Graph Script Induction via Multimedia Grounding [129.83134296316493]
We train a script knowledge model capable of both generating explicit graph scripts for learnt tasks and predicting future steps given a partial step sequence.
Human evaluation shows our model outperforming the WikiHow linear baseline by 48.76% absolute gains in capturing sequential and non-sequential step relationships.
arXiv Detail & Related papers (2023-05-27T18:13:17Z) - Procedure-Aware Pretraining for Instructional Video Understanding [58.214549181779006]
Key challenge in procedure understanding is to be able to extract from unlabeled videos the procedural knowledge.
Our main insight is that instructional videos depict sequences of steps that repeat between instances of the same or different tasks.
This graph can then be used to generate pseudo labels to train a video representation that encodes the procedural knowledge in a more accessible form.
arXiv Detail & Related papers (2023-03-31T17:41:31Z) - Unsupervised Task Graph Generation from Instructional Video Transcripts [53.54435048879365]
We consider a setting where text transcripts of instructional videos performing a real-world activity are provided.
The goal is to identify the key steps relevant to the task as well as the dependency relationship between these key steps.
We propose a novel task graph generation approach that combines the reasoning capabilities of instruction-tuned language models along with clustering and ranking components.
arXiv Detail & Related papers (2023-02-17T22:50:08Z) - Learning To Recognize Procedural Activities with Distant Supervision [96.58436002052466]
We consider the problem of classifying fine-grained, multi-step activities from long videos spanning up to several minutes.
Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base.
arXiv Detail & Related papers (2022-01-26T15:06:28Z) - LEMMA: A Multi-view Dataset for Learning Multi-agent Multi-task
Activities [119.88381048477854]
We introduce the LEMMA dataset to provide a single home to address missing dimensions with meticulously designed settings.
We densely annotate the atomic-actions with human-object interactions to provide ground-truths of the compositionality, scheduling, and assignment of daily activities.
We hope this effort would drive the machine vision community to examine goal-directed human activities and further study the task scheduling and assignment in the real world.
arXiv Detail & Related papers (2020-07-31T00:13:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.