StepFormer: Self-supervised Step Discovery and Localization in
Instructional Videos
- URL: http://arxiv.org/abs/2304.13265v1
- Date: Wed, 26 Apr 2023 03:37:28 GMT
- Title: StepFormer: Self-supervised Step Discovery and Localization in
Instructional Videos
- Authors: Nikita Dvornik, Isma Hadji, Ran Zhang, Konstantinos G. Derpanis,
Animesh Garg, Richard P. Wildes, Allan D. Jepson
- Abstract summary: We introduce StepFormer, a self-supervised model that discovers and localizes instruction steps in a video.
We train our system on a large dataset of instructional videos, using their automatically-generated subtitles as the only source of supervision.
Our model outperforms all previous unsupervised and weakly-supervised approaches on step detection and localization.
- Score: 47.03252542488226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Instructional videos are an important resource to learn procedural tasks from
human demonstrations. However, the instruction steps in such videos are
typically short and sparse, with most of the video being irrelevant to the
procedure. This motivates the need to temporally localize the instruction steps
in such videos, i.e. the task called key-step localization. Traditional methods
for key-step localization require video-level human annotations and thus do not
scale to large datasets. In this work, we tackle the problem with no human
supervision and introduce StepFormer, a self-supervised model that discovers
and localizes instruction steps in a video. StepFormer is a transformer decoder
that attends to the video with learnable queries, and produces a sequence of
slots capturing the key-steps in the video. We train our system on a large
dataset of instructional videos, using their automatically-generated subtitles
as the only source of supervision. In particular, we supervise our system with
a sequence of text narrations using an order-aware loss function that filters
out irrelevant phrases. We show that our model outperforms all previous
unsupervised and weakly-supervised approaches on step detection and
localization by a large margin on three challenging benchmarks. Moreover, our
model demonstrates an emergent property to solve zero-shot multi-step
localization and outperforms all relevant baselines at this task.
Related papers
- Video-Mined Task Graphs for Keystep Recognition in Instructional Videos [71.16703750980143]
Procedural activity understanding requires perceiving human actions in terms of a broader task.
We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps.
We show the impact: more reliable zero-shot keystep localization and improved video representation learning.
arXiv Detail & Related papers (2023-07-17T18:19:36Z) - Learning to Ground Instructional Articles in Videos through Narrations [50.3463147014498]
We present an approach for localizing steps of procedural activities in narrated how-to videos.
We source the step descriptions from a language knowledge base (wikiHow) containing instructional articles.
Our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities.
arXiv Detail & Related papers (2023-06-06T15:45:53Z) - Learning and Verification of Task Structure in Instructional Videos [85.511888642497]
We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos.
Compared to prior work which learns step representations locally, our approach involves learning them globally.
We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order.
arXiv Detail & Related papers (2023-03-23T17:59:54Z) - Learning To Recognize Procedural Activities with Distant Supervision [96.58436002052466]
We consider the problem of classifying fine-grained, multi-step activities from long videos spanning up to several minutes.
Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base.
arXiv Detail & Related papers (2022-01-26T15:06:28Z) - Unsupervised Discovery of Actions in Instructional Videos [86.77350242461803]
We present an unsupervised approach to learn atomic actions of structured human tasks from a variety of instructional videos.
We propose a sequential autoregressive model for temporal segmentation of videos, which learns to represent and discover the sequential relationship between different atomic actions of the task.
Our approach outperforms the state-of-the-art unsupervised methods with large margins.
arXiv Detail & Related papers (2021-06-28T14:05:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.