Unsupervised Action Segmentation for Instructional Videos
- URL: http://arxiv.org/abs/2106.03738v1
- Date: Mon, 7 Jun 2021 16:02:06 GMT
- Title: Unsupervised Action Segmentation for Instructional Videos
- Authors: AJ Piergiovanni and Anelia Angelova and Michael S. Ryoo and Irfan Essa
- Abstract summary: We present an unsupervised approach to learn atomic actions of structured human tasks from a variety of instructional videos.
This learns to represent and discover the sequential relationship between different atomic actions of the task, and which provides automatic and unsupervised self-labeling.
- Score: 86.77350242461803
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we address the problem of automatically discovering atomic
actions in unsupervised manner from instructional videos, which are rarely
annotated with atomic actions. We present an unsupervised approach to learn
atomic actions of structured human tasks from a variety of instructional videos
based on a sequential stochastic autoregressive model for temporal segmentation
of videos. This learns to represent and discover the sequential relationship
between different atomic actions of the task, and which provides automatic and
unsupervised self-labeling.
Related papers
- Temporal Divide-and-Conquer Anomaly Actions Localization in Semi-Supervised Videos with Hierarchical Transformer [0.9208007322096532]
Anomaly action detection and localization play an essential role in security and advanced surveillance systems.
We propose a hierarchical transformer model designed to evaluate the significance of observed actions in anomalous videos.
Our approach segments a parent video hierarchically into multiple temporal children instances and measures the influence of the children nodes in classifying the abnormality of the parent video.
arXiv Detail & Related papers (2024-08-24T18:12:58Z) - StepFormer: Self-supervised Step Discovery and Localization in
Instructional Videos [47.03252542488226]
We introduce StepFormer, a self-supervised model that discovers and localizes instruction steps in a video.
We train our system on a large dataset of instructional videos, using their automatically-generated subtitles as the only source of supervision.
Our model outperforms all previous unsupervised and weakly-supervised approaches on step detection and localization.
arXiv Detail & Related papers (2023-04-26T03:37:28Z) - Weakly-Supervised Temporal Action Detection for Fine-Grained Videos with
Hierarchical Atomic Actions [13.665489987620724]
We tackle the problem of weakly-supervised fine-grained temporal action detection in videos for the first time.
We propose to model actions as the combinations of reusable atomic actions which are automatically discovered from data.
Our approach constructs a visual representation hierarchy of four levels: clip level, atomic action level, fine action class level and coarse action class level, with supervision at each level.
arXiv Detail & Related papers (2022-07-24T20:32:24Z) - Learning To Recognize Procedural Activities with Distant Supervision [96.58436002052466]
We consider the problem of classifying fine-grained, multi-step activities from long videos spanning up to several minutes.
Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base.
arXiv Detail & Related papers (2022-01-26T15:06:28Z) - Learning to Align Sequential Actions in the Wild [123.62879270881807]
We propose an approach to align sequential actions in the wild that involve diverse temporal variations.
Our model accounts for both monotonic and non-monotonic sequences.
We demonstrate that our approach consistently outperforms the state-of-the-art in self-supervised sequential action representation learning.
arXiv Detail & Related papers (2021-11-17T18:55:36Z) - Unsupervised Discovery of Actions in Instructional Videos [86.77350242461803]
We present an unsupervised approach to learn atomic actions of structured human tasks from a variety of instructional videos.
We propose a sequential autoregressive model for temporal segmentation of videos, which learns to represent and discover the sequential relationship between different atomic actions of the task.
Our approach outperforms the state-of-the-art unsupervised methods with large margins.
arXiv Detail & Related papers (2021-06-28T14:05:01Z) - Learning to Segment Actions from Observation and Narration [56.99443314542545]
We apply a generative segmental model of task structure, guided by narration, to action segmentation in video.
We focus on unsupervised and weakly-supervised settings where no action labels are known during training.
arXiv Detail & Related papers (2020-05-07T18:03:57Z) - A Benchmark for Structured Procedural Knowledge Extraction from Cooking
Videos [126.66212285239624]
We propose a benchmark of structured procedural knowledge extracted from cooking videos.
Our manually annotated open-vocabulary resource includes 356 instructional cooking videos and 15,523 video clip/sentence-level annotations.
arXiv Detail & Related papers (2020-05-02T05:15:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.