STEPs: Self-Supervised Key Step Extraction and Localization from
Unlabeled Procedural Videos
- URL: http://arxiv.org/abs/2301.00794v3
- Date: Sat, 9 Sep 2023 15:08:02 GMT
- Title: STEPs: Self-Supervised Key Step Extraction and Localization from
Unlabeled Procedural Videos
- Authors: Anshul Shah, Benjamin Lundell, Harpreet Sawhney, Rama Chellappa
- Abstract summary: We decompose the problem into two steps: representation learning and key steps extraction.
We propose a training objective, Bootstrapped Multi-Cue Contrastive (BMC2) loss to learn discriminative representations for various steps without any labels.
We show significant improvements over prior works for the task of key step localization and phase classification.
- Score: 40.82053186029603
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We address the problem of extracting key steps from unlabeled procedural
videos, motivated by the potential of Augmented Reality (AR) headsets to
revolutionize job training and performance. We decompose the problem into two
steps: representation learning and key steps extraction. We propose a training
objective, Bootstrapped Multi-Cue Contrastive (BMC2) loss to learn
discriminative representations for various steps without any labels. Different
from prior works, we develop techniques to train a light-weight temporal module
which uses off-the-shelf features for self supervision. Our approach can
seamlessly leverage information from multiple cues like optical flow, depth or
gaze to learn discriminative features for key-steps, making it amenable for AR
applications. We finally extract key steps via a tunable algorithm that
clusters the representations and samples. We show significant improvements over
prior works for the task of key step localization and phase classification.
Qualitative results demonstrate that the extracted key steps are meaningful and
succinctly represent various steps of the procedural tasks.
Related papers
- Reinforcement Learning with Action Sequence for Data-Efficient Robot Learning [62.3886343725955]
We introduce a novel RL algorithm that learns a critic network that outputs Q-values over a sequence of actions.
By explicitly training the value functions to learn the consequence of executing a series of current and future actions, our algorithm allows for learning useful value functions from noisy trajectories.
arXiv Detail & Related papers (2024-11-19T01:23:52Z) - PRISE: LLM-Style Sequence Compression for Learning Temporal Action Abstractions in Control [55.81022882408587]
Temporal action abstractions, along with belief state representations, are a powerful knowledge sharing mechanism for sequential decision making.
We propose a novel view that treats inducing temporal action abstractions as a sequence compression problem.
We introduce an approach that combines continuous action quantization with byte pair encoding to learn powerful action abstractions.
arXiv Detail & Related papers (2024-02-16T04:55:09Z) - Video-Mined Task Graphs for Keystep Recognition in Instructional Videos [71.16703750980143]
Procedural activity understanding requires perceiving human actions in terms of a broader task.
We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps.
We show the impact: more reliable zero-shot keystep localization and improved video representation learning.
arXiv Detail & Related papers (2023-07-17T18:19:36Z) - SeMAIL: Eliminating Distractors in Visual Imitation via Separated Models [22.472167814814448]
We propose a new model-based imitation learning algorithm named Separated Model-based Adversarial Imitation Learning (SeMAIL)
Our method achieves near-expert performance on various visual control tasks with complex observations and the more challenging tasks with different backgrounds from expert observations.
arXiv Detail & Related papers (2023-06-19T04:33:44Z) - Iterative Forward Tuning Boosts In-Context Learning in Language Models [88.25013390669845]
In this study, we introduce a novel two-stage framework to boost in-context learning in large language models (LLMs)
Specifically, our framework delineates the ICL process into two distinct stages: Deep-Thinking and test stages.
The Deep-Thinking stage incorporates a unique attention mechanism, i.e., iterative enhanced attention, which enables multiple rounds of information accumulation.
arXiv Detail & Related papers (2023-05-22T13:18:17Z) - Learning Procedure-aware Video Representation from Instructional Videos
and Their Narrations [22.723309913388196]
We learn video representation that encodes both action steps and their temporal ordering, based on a large-scale dataset of web instructional videos and their narrations.
Our method jointly learns a video representation to encode individual step concepts, and a deep probabilistic model to capture both temporal dependencies and immense individual variations in the step ordering.
arXiv Detail & Related papers (2023-03-31T07:02:26Z) - Automatic Discovery of Multi-perspective Process Model using
Reinforcement Learning [7.5989847759545155]
We propose an automatic discovery framework of a multi-perspective process model based on deep Q-Learning.
Our Dual Experience Replay with Experience Distribution (DERED) approach can automatically perform process model discovery steps, conformance check steps, and enhancements steps.
We validate our approach using six real-world event datasets collected in port logistics, steel manufacturing, finance, IT, and government administration.
arXiv Detail & Related papers (2022-11-30T02:18:29Z) - Learning To Recognize Procedural Activities with Distant Supervision [96.58436002052466]
We consider the problem of classifying fine-grained, multi-step activities from long videos spanning up to several minutes.
Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base.
arXiv Detail & Related papers (2022-01-26T15:06:28Z) - Self-supervised Knowledge Distillation for Few-shot Learning [123.10294801296926]
Few shot learning is a promising learning paradigm due to its ability to learn out of order distributions quickly with only a few samples.
We propose a simple approach to improve the representation capacity of deep neural networks for few-shot learning tasks.
Our experiments show that, even in the first stage, self-supervision can outperform current state-of-the-art methods.
arXiv Detail & Related papers (2020-06-17T11:27:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.