Learning Procedure-aware Video Representation from Instructional Videos
and Their Narrations
- URL: http://arxiv.org/abs/2303.17839v1
- Date: Fri, 31 Mar 2023 07:02:26 GMT
- Title: Learning Procedure-aware Video Representation from Instructional Videos
and Their Narrations
- Authors: Yiwu Zhong, Licheng Yu, Yang Bai, Shangwen Li, Xueting Yan, Yin Li
- Abstract summary: We learn video representation that encodes both action steps and their temporal ordering, based on a large-scale dataset of web instructional videos and their narrations.
Our method jointly learns a video representation to encode individual step concepts, and a deep probabilistic model to capture both temporal dependencies and immense individual variations in the step ordering.
- Score: 22.723309913388196
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The abundance of instructional videos and their narrations over the Internet
offers an exciting avenue for understanding procedural activities. In this
work, we propose to learn video representation that encodes both action steps
and their temporal ordering, based on a large-scale dataset of web
instructional videos and their narrations, without using human annotations. Our
method jointly learns a video representation to encode individual step
concepts, and a deep probabilistic model to capture both temporal dependencies
and immense individual variations in the step ordering. We empirically
demonstrate that learning temporal ordering not only enables new capabilities
for procedure reasoning, but also reinforces the recognition of individual
steps. Our model significantly advances the state-of-the-art results on step
classification (+2.8% / +3.3% on COIN / EPIC-Kitchens) and step forecasting
(+7.4% on COIN). Moreover, our model attains promising results in zero-shot
inference for step classification and forecasting, as well as in predicting
diverse and plausible steps for incomplete procedures. Our code is available at
https://github.com/facebookresearch/ProcedureVRL.
Related papers
- Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - Video-Mined Task Graphs for Keystep Recognition in Instructional Videos [71.16703750980143]
Procedural activity understanding requires perceiving human actions in terms of a broader task.
We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps.
We show the impact: more reliable zero-shot keystep localization and improved video representation learning.
arXiv Detail & Related papers (2023-07-17T18:19:36Z) - Learning to Ground Instructional Articles in Videos through Narrations [50.3463147014498]
We present an approach for localizing steps of procedural activities in narrated how-to videos.
We source the step descriptions from a language knowledge base (wikiHow) containing instructional articles.
Our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities.
arXiv Detail & Related papers (2023-06-06T15:45:53Z) - Procedure-Aware Pretraining for Instructional Video Understanding [58.214549181779006]
Key challenge in procedure understanding is to be able to extract from unlabeled videos the procedural knowledge.
Our main insight is that instructional videos depict sequences of steps that repeat between instances of the same or different tasks.
This graph can then be used to generate pseudo labels to train a video representation that encodes the procedural knowledge in a more accessible form.
arXiv Detail & Related papers (2023-03-31T17:41:31Z) - Learning and Verification of Task Structure in Instructional Videos [85.511888642497]
We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos.
Compared to prior work which learns step representations locally, our approach involves learning them globally.
We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order.
arXiv Detail & Related papers (2023-03-23T17:59:54Z) - My View is the Best View: Procedure Learning from Egocentric Videos [31.385646424154732]
Existing approaches commonly use third-person videos for learning the procedure.
We observe that videos obtained from first-person (egocentric) wearable cameras provide an unobstructed and clear view of the action.
We present a novel self-supervised Correspond and Cut framework for procedure learning.
arXiv Detail & Related papers (2022-07-22T05:28:11Z) - P3IV: Probabilistic Procedure Planning from Instructional Videos with
Weak Supervision [31.73732506824829]
We study the problem of procedure planning in instructional videos.
Here, an agent must produce a plausible sequence of actions that can transform the environment from a given start to a desired goal state.
We propose a weakly supervised approach by learning from natural language instructions.
arXiv Detail & Related papers (2022-05-04T19:37:32Z) - Learning To Recognize Procedural Activities with Distant Supervision [96.58436002052466]
We consider the problem of classifying fine-grained, multi-step activities from long videos spanning up to several minutes.
Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base.
arXiv Detail & Related papers (2022-01-26T15:06:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.