Unsupervised Discovery of Actions in Instructional Videos
- URL: http://arxiv.org/abs/2106.14733v1
- Date: Mon, 28 Jun 2021 14:05:01 GMT
- Title: Unsupervised Discovery of Actions in Instructional Videos
- Authors: AJ Piergiovanni and Anelia Angelova and Michael S. Ryoo and Irfan Essa
- Abstract summary: We present an unsupervised approach to learn atomic actions of structured human tasks from a variety of instructional videos.
We propose a sequential autoregressive model for temporal segmentation of videos, which learns to represent and discover the sequential relationship between different atomic actions of the task.
Our approach outperforms the state-of-the-art unsupervised methods with large margins.
- Score: 86.77350242461803
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we address the problem of automatically discovering atomic
actions in unsupervised manner from instructional videos. Instructional videos
contain complex activities and are a rich source of information for intelligent
agents, such as, autonomous robots or virtual assistants, which can, for
example, automatically `read' the steps from an instructional video and execute
them. However, videos are rarely annotated with atomic activities, their
boundaries or duration. We present an unsupervised approach to learn atomic
actions of structured human tasks from a variety of instructional videos. We
propose a sequential stochastic autoregressive model for temporal segmentation
of videos, which learns to represent and discover the sequential relationship
between different atomic actions of the task, and which provides automatic and
unsupervised self-labeling for videos. Our approach outperforms the
state-of-the-art unsupervised methods with large margins. We will open source
the code.
Related papers
- Learning to Act from Actionless Videos through Dense Correspondences [87.1243107115642]
We present an approach to construct a video-based robot policy capable of reliably executing diverse tasks across different robots and environments.
Our method leverages images as a task-agnostic representation, encoding both the state and action information, and text as a general representation for specifying robot goals.
We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks.
arXiv Detail & Related papers (2023-10-12T17:59:23Z) - StepFormer: Self-supervised Step Discovery and Localization in
Instructional Videos [47.03252542488226]
We introduce StepFormer, a self-supervised model that discovers and localizes instruction steps in a video.
We train our system on a large dataset of instructional videos, using their automatically-generated subtitles as the only source of supervision.
Our model outperforms all previous unsupervised and weakly-supervised approaches on step detection and localization.
arXiv Detail & Related papers (2023-04-26T03:37:28Z) - Learning To Recognize Procedural Activities with Distant Supervision [96.58436002052466]
We consider the problem of classifying fine-grained, multi-step activities from long videos spanning up to several minutes.
Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base.
arXiv Detail & Related papers (2022-01-26T15:06:28Z) - Unsupervised Action Segmentation for Instructional Videos [86.77350242461803]
We present an unsupervised approach to learn atomic actions of structured human tasks from a variety of instructional videos.
This learns to represent and discover the sequential relationship between different atomic actions of the task, and which provides automatic and unsupervised self-labeling.
arXiv Detail & Related papers (2021-06-07T16:02:06Z) - Learning Object Manipulation Skills via Approximate State Estimation
from Real Videos [47.958512470724926]
Humans are adept at learning new tasks by watching a few instructional videos.
On the other hand, robots that learn new actions either require a lot of effort through trial and error, or use expert demonstrations that are challenging to obtain.
In this paper, we explore a method that facilitates learning object manipulation skills directly from videos.
arXiv Detail & Related papers (2020-11-13T08:53:47Z) - Exploring Relations in Untrimmed Videos for Self-Supervised Learning [17.670226952829506]
Existing self-supervised learning methods mainly rely on trimmed videos for model training.
We propose a novel self-supervised method, referred to as Exploring Relations in Untemporal Videos (ERUV)
ERUV is able to learn richer representations and it outperforms state-of-the-art self-supervised methods with significant margins.
arXiv Detail & Related papers (2020-08-06T15:29:25Z) - A Benchmark for Structured Procedural Knowledge Extraction from Cooking
Videos [126.66212285239624]
We propose a benchmark of structured procedural knowledge extracted from cooking videos.
Our manually annotated open-vocabulary resource includes 356 instructional cooking videos and 15,523 video clip/sentence-level annotations.
arXiv Detail & Related papers (2020-05-02T05:15:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.