Learning To Recognize Procedural Activities with Distant Supervision
- URL: http://arxiv.org/abs/2201.10990v1
- Date: Wed, 26 Jan 2022 15:06:28 GMT
- Title: Learning To Recognize Procedural Activities with Distant Supervision
- Authors: Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu
Chang, Lorenzo Torresani
- Abstract summary: We consider the problem of classifying fine-grained, multi-step activities from long videos spanning up to several minutes.
Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base.
- Score: 96.58436002052466
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we consider the problem of classifying fine-grained, multi-step
activities (e.g., cooking different recipes, making disparate home
improvements, creating various forms of arts and crafts) from long videos
spanning up to several minutes. Accurately categorizing these activities
requires not only recognizing the individual steps that compose the task but
also capturing their temporal dependencies. This problem is dramatically
different from traditional action classification, where models are typically
optimized on videos that span only a few seconds and that are manually trimmed
to contain simple atomic actions. While step annotations could enable the
training of models to recognize the individual steps of procedural activities,
existing large-scale datasets in this area do not include such segment labels
due to the prohibitive cost of manually annotating temporal boundaries in long
videos. To address this issue, we propose to automatically identify steps in
instructional videos by leveraging the distant supervision of a textual
knowledge base (wikiHow) that includes detailed descriptions of the steps
needed for the execution of a wide variety of complex activities. Our method
uses a language model to match noisy, automatically-transcribed speech from the
video to step descriptions in the knowledge base. We demonstrate that video
models trained to recognize these automatically-labeled steps (without manual
supervision) yield a representation that achieves superior generalization
performance on four downstream tasks: recognition of procedural activities,
step classification, step forecasting and egocentric video classification.
Related papers
- Video-Mined Task Graphs for Keystep Recognition in Instructional Videos [71.16703750980143]
Procedural activity understanding requires perceiving human actions in terms of a broader task.
We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps.
We show the impact: more reliable zero-shot keystep localization and improved video representation learning.
arXiv Detail & Related papers (2023-07-17T18:19:36Z) - Learning to Ground Instructional Articles in Videos through Narrations [50.3463147014498]
We present an approach for localizing steps of procedural activities in narrated how-to videos.
We source the step descriptions from a language knowledge base (wikiHow) containing instructional articles.
Our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities.
arXiv Detail & Related papers (2023-06-06T15:45:53Z) - StepFormer: Self-supervised Step Discovery and Localization in
Instructional Videos [47.03252542488226]
We introduce StepFormer, a self-supervised model that discovers and localizes instruction steps in a video.
We train our system on a large dataset of instructional videos, using their automatically-generated subtitles as the only source of supervision.
Our model outperforms all previous unsupervised and weakly-supervised approaches on step detection and localization.
arXiv Detail & Related papers (2023-04-26T03:37:28Z) - Procedure-Aware Pretraining for Instructional Video Understanding [58.214549181779006]
Key challenge in procedure understanding is to be able to extract from unlabeled videos the procedural knowledge.
Our main insight is that instructional videos depict sequences of steps that repeat between instances of the same or different tasks.
This graph can then be used to generate pseudo labels to train a video representation that encodes the procedural knowledge in a more accessible form.
arXiv Detail & Related papers (2023-03-31T17:41:31Z) - Learning Procedure-aware Video Representation from Instructional Videos
and Their Narrations [22.723309913388196]
We learn video representation that encodes both action steps and their temporal ordering, based on a large-scale dataset of web instructional videos and their narrations.
Our method jointly learns a video representation to encode individual step concepts, and a deep probabilistic model to capture both temporal dependencies and immense individual variations in the step ordering.
arXiv Detail & Related papers (2023-03-31T07:02:26Z) - Learning and Verification of Task Structure in Instructional Videos [85.511888642497]
We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos.
Compared to prior work which learns step representations locally, our approach involves learning them globally.
We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order.
arXiv Detail & Related papers (2023-03-23T17:59:54Z) - A Benchmark for Structured Procedural Knowledge Extraction from Cooking
Videos [126.66212285239624]
We propose a benchmark of structured procedural knowledge extracted from cooking videos.
Our manually annotated open-vocabulary resource includes 356 instructional cooking videos and 15,523 video clip/sentence-level annotations.
arXiv Detail & Related papers (2020-05-02T05:15:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.