Procedure-Aware Pretraining for Instructional Video Understanding
- URL: http://arxiv.org/abs/2303.18230v1
- Date: Fri, 31 Mar 2023 17:41:31 GMT
- Title: Procedure-Aware Pretraining for Instructional Video Understanding
- Authors: Honglu Zhou, Roberto Mart\'in-Mart\'in, Mubbasir Kapadia, Silvio
Savarese, Juan Carlos Niebles
- Abstract summary: Key challenge in procedure understanding is to be able to extract from unlabeled videos the procedural knowledge.
Our main insight is that instructional videos depict sequences of steps that repeat between instances of the same or different tasks.
This graph can then be used to generate pseudo labels to train a video representation that encodes the procedural knowledge in a more accessible form.
- Score: 58.214549181779006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Our goal is to learn a video representation that is useful for downstream
procedure understanding tasks in instructional videos. Due to the small amount
of available annotations, a key challenge in procedure understanding is to be
able to extract from unlabeled videos the procedural knowledge such as the
identity of the task (e.g., 'make latte'), its steps (e.g., 'pour milk'), or
the potential next steps given partial progress in its execution. Our main
insight is that instructional videos depict sequences of steps that repeat
between instances of the same or different tasks, and that this structure can
be well represented by a Procedural Knowledge Graph (PKG), where nodes are
discrete steps and edges connect steps that occur sequentially in the
instructional activities. This graph can then be used to generate pseudo labels
to train a video representation that encodes the procedural knowledge in a more
accessible form to generalize to multiple procedure understanding tasks. We
build a PKG by combining information from a text-based procedural knowledge
database and an unlabeled instructional video corpus and then use it to
generate training pseudo labels with four novel pre-training objectives. We
call this PKG-based pre-training procedure and the resulting model Paprika,
Procedure-Aware PRe-training for Instructional Knowledge Acquisition. We
evaluate Paprika on COIN and CrossTask for procedure understanding tasks such
as task recognition, step recognition, and step forecasting. Paprika yields a
video representation that improves over the state of the art: up to 11.23%
gains in accuracy in 12 evaluation settings. Implementation is available at
https://github.com/salesforce/paprika.
Related papers
- Video-Mined Task Graphs for Keystep Recognition in Instructional Videos [71.16703750980143]
Procedural activity understanding requires perceiving human actions in terms of a broader task.
We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps.
We show the impact: more reliable zero-shot keystep localization and improved video representation learning.
arXiv Detail & Related papers (2023-07-17T18:19:36Z) - Non-Sequential Graph Script Induction via Multimedia Grounding [129.83134296316493]
We train a script knowledge model capable of both generating explicit graph scripts for learnt tasks and predicting future steps given a partial step sequence.
Human evaluation shows our model outperforming the WikiHow linear baseline by 48.76% absolute gains in capturing sequential and non-sequential step relationships.
arXiv Detail & Related papers (2023-05-27T18:13:17Z) - Learning Procedure-aware Video Representation from Instructional Videos
and Their Narrations [22.723309913388196]
We learn video representation that encodes both action steps and their temporal ordering, based on a large-scale dataset of web instructional videos and their narrations.
Our method jointly learns a video representation to encode individual step concepts, and a deep probabilistic model to capture both temporal dependencies and immense individual variations in the step ordering.
arXiv Detail & Related papers (2023-03-31T07:02:26Z) - Learning and Verification of Task Structure in Instructional Videos [85.511888642497]
We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos.
Compared to prior work which learns step representations locally, our approach involves learning them globally.
We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order.
arXiv Detail & Related papers (2023-03-23T17:59:54Z) - Graph2Vid: Flow graph to Video Grounding forWeakly-supervised Multi-Step
Localization [14.95378874133603]
We consider the problem of weakly-supervised multi-step localization in instructional videos.
An established approach to this problem is to rely on a given list of steps.
We propose a new algorithm - Graph2Vid - that infers the actual ordering of steps in the video and simultaneously localizes them.
arXiv Detail & Related papers (2022-10-10T20:02:58Z) - Learning To Recognize Procedural Activities with Distant Supervision [96.58436002052466]
We consider the problem of classifying fine-grained, multi-step activities from long videos spanning up to several minutes.
Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base.
arXiv Detail & Related papers (2022-01-26T15:06:28Z) - A Benchmark for Structured Procedural Knowledge Extraction from Cooking
Videos [126.66212285239624]
We propose a benchmark of structured procedural knowledge extracted from cooking videos.
Our manually annotated open-vocabulary resource includes 356 instructional cooking videos and 15,523 video clip/sentence-level annotations.
arXiv Detail & Related papers (2020-05-02T05:15:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.