Learning and Verification of Task Structure in Instructional Videos
- URL: http://arxiv.org/abs/2303.13519v1
- Date: Thu, 23 Mar 2023 17:59:54 GMT
- Title: Learning and Verification of Task Structure in Instructional Videos
- Authors: Medhini Narasimhan, Licheng Yu, Sean Bell, Ning Zhang, Trevor Darrell
- Abstract summary: We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos.
Compared to prior work which learns step representations locally, our approach involves learning them globally.
We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order.
- Score: 85.511888642497
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given the enormous number of instructional videos available online, learning
a diverse array of multi-step task models from videos is an appealing goal. We
introduce a new pre-trained video model, VideoTaskformer, focused on
representing the semantics and structure of instructional videos. We pre-train
VideoTaskformer using a simple and effective objective: predicting weakly
supervised textual labels for steps that are randomly masked out from an
instructional video (masked step modeling). Compared to prior work which learns
step representations locally, our approach involves learning them globally,
leveraging video of the entire surrounding task as context. From these learned
representations, we can verify if an unseen video correctly executes a given
task, as well as forecast which steps are likely to be taken after a given
step. We introduce two new benchmarks for detecting mistakes in instructional
videos, to verify if there is an anomalous step and if steps are executed in
the right order. We also introduce a long-term forecasting benchmark, where the
goal is to predict long-range future steps from a given step. Our method
outperforms previous baselines on these tasks, and we believe the tasks will be
a valuable way for the community to measure the quality of step
representations. Additionally, we evaluate VideoTaskformer on 3 existing
benchmarks -- procedural activity recognition, step classification, and step
forecasting -- and demonstrate on each that our method outperforms existing
baselines and achieves new state-of-the-art performance.
Related papers
- Video-Mined Task Graphs for Keystep Recognition in Instructional Videos [71.16703750980143]
Procedural activity understanding requires perceiving human actions in terms of a broader task.
We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps.
We show the impact: more reliable zero-shot keystep localization and improved video representation learning.
arXiv Detail & Related papers (2023-07-17T18:19:36Z) - Learning to Ground Instructional Articles in Videos through Narrations [50.3463147014498]
We present an approach for localizing steps of procedural activities in narrated how-to videos.
We source the step descriptions from a language knowledge base (wikiHow) containing instructional articles.
Our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities.
arXiv Detail & Related papers (2023-06-06T15:45:53Z) - Non-Sequential Graph Script Induction via Multimedia Grounding [129.83134296316493]
We train a script knowledge model capable of both generating explicit graph scripts for learnt tasks and predicting future steps given a partial step sequence.
Human evaluation shows our model outperforming the WikiHow linear baseline by 48.76% absolute gains in capturing sequential and non-sequential step relationships.
arXiv Detail & Related papers (2023-05-27T18:13:17Z) - StepFormer: Self-supervised Step Discovery and Localization in
Instructional Videos [47.03252542488226]
We introduce StepFormer, a self-supervised model that discovers and localizes instruction steps in a video.
We train our system on a large dataset of instructional videos, using their automatically-generated subtitles as the only source of supervision.
Our model outperforms all previous unsupervised and weakly-supervised approaches on step detection and localization.
arXiv Detail & Related papers (2023-04-26T03:37:28Z) - Procedure-Aware Pretraining for Instructional Video Understanding [58.214549181779006]
Key challenge in procedure understanding is to be able to extract from unlabeled videos the procedural knowledge.
Our main insight is that instructional videos depict sequences of steps that repeat between instances of the same or different tasks.
This graph can then be used to generate pseudo labels to train a video representation that encodes the procedural knowledge in a more accessible form.
arXiv Detail & Related papers (2023-03-31T17:41:31Z) - Learning Procedure-aware Video Representation from Instructional Videos
and Their Narrations [22.723309913388196]
We learn video representation that encodes both action steps and their temporal ordering, based on a large-scale dataset of web instructional videos and their narrations.
Our method jointly learns a video representation to encode individual step concepts, and a deep probabilistic model to capture both temporal dependencies and immense individual variations in the step ordering.
arXiv Detail & Related papers (2023-03-31T07:02:26Z) - Learning To Recognize Procedural Activities with Distant Supervision [96.58436002052466]
We consider the problem of classifying fine-grained, multi-step activities from long videos spanning up to several minutes.
Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base.
arXiv Detail & Related papers (2022-01-26T15:06:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.