P3IV: Probabilistic Procedure Planning from Instructional Videos with
Weak Supervision
- URL: http://arxiv.org/abs/2205.02300v1
- Date: Wed, 4 May 2022 19:37:32 GMT
- Title: P3IV: Probabilistic Procedure Planning from Instructional Videos with
Weak Supervision
- Authors: He Zhao and Isma Hadji and Nikita Dvornik and Konstantinos G. Derpanis
and Richard P. Wildes and Allan D. Jepson
- Abstract summary: We study the problem of procedure planning in instructional videos.
Here, an agent must produce a plausible sequence of actions that can transform the environment from a given start to a desired goal state.
We propose a weakly supervised approach by learning from natural language instructions.
- Score: 31.73732506824829
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study the problem of procedure planning in instructional
videos. Here, an agent must produce a plausible sequence of actions that can
transform the environment from a given start to a desired goal state. When
learning procedure planning from instructional videos, most recent work
leverages intermediate visual observations as supervision, which requires
expensive annotation efforts to localize precisely all the instructional steps
in training videos. In contrast, we remove the need for expensive temporal
video annotations and propose a weakly supervised approach by learning from
natural language instructions. Our model is based on a transformer equipped
with a memory module, which maps the start and goal observations to a sequence
of plausible actions. Furthermore, we augment our model with a probabilistic
generative module to capture the uncertainty inherent to procedure planning, an
aspect largely overlooked by previous work. We evaluate our model on three
datasets and show our weaklysupervised approach outperforms previous fully
supervised state-of-the-art models on multiple metrics.
Related papers
- VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning [59.68917139718813]
We show that a strong off-the-shelf frozen pretrained visual encoder can achieve state-of-the-art (SoTA) performance in forecasting and procedural planning.
By conditioning on frozen clip-level embeddings from observed steps to predict the actions of unseen steps, our prediction model is able to learn robust representations for forecasting.
arXiv Detail & Related papers (2024-10-04T14:52:09Z) - Test-Time Zero-Shot Temporal Action Localization [58.84919541314969]
ZS-TAL seeks to identify and locate actions in untrimmed videos unseen during training.
Training-based ZS-TAL approaches assume the availability of labeled data for supervised learning.
We introduce a novel method that performs Test-Time adaptation for Temporal Action localization (T3AL)
arXiv Detail & Related papers (2024-04-08T11:54:49Z) - Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos [16.333295670635557]
We explore the capability of an agent to construct a logical sequence of action steps, thereby assembling a strategic procedural plan.
This plan is crucial for navigating from an initial visual observation to a target visual outcome, as depicted in real-life instructional videos.
We coin our approach KEPP, a novel Knowledge-Enhanced Procedure Planning system, which harnesses a probabilistic procedural knowledge graph extracted from training data.
arXiv Detail & Related papers (2024-03-05T08:55:51Z) - CI w/o TN: Context Injection without Task Name for Procedure Planning [4.004155037293416]
Procedure planning in instructional videos involves creating goal-directed plans based on visual start and goal observations from videos.
Previous research has tackled this problem with gradually weaker training supervision, from heavy intermediate visual observations or language instructions to task class supervision.
We propose a much weaker setting without task name as supervision, which is not currently solvable by existing large language models.
arXiv Detail & Related papers (2024-02-23T19:34:47Z) - A Control-Centric Benchmark for Video Prediction [69.22614362800692]
We propose a benchmark for action-conditioned video prediction in the form of a control benchmark.
Our benchmark includes simulated environments with 11 task categories and 310 task instance definitions.
We then leverage our benchmark to study the effects of scaling model size, quantity of training data, and model ensembling.
arXiv Detail & Related papers (2023-04-26T17:59:45Z) - Learning Procedure-aware Video Representation from Instructional Videos
and Their Narrations [22.723309913388196]
We learn video representation that encodes both action steps and their temporal ordering, based on a large-scale dataset of web instructional videos and their narrations.
Our method jointly learns a video representation to encode individual step concepts, and a deep probabilistic model to capture both temporal dependencies and immense individual variations in the step ordering.
arXiv Detail & Related papers (2023-03-31T07:02:26Z) - PDPP:Projected Diffusion for Procedure Planning in Instructional Videos [30.637651835289635]
We study the problem of procedure planning in instructional videos.
This problem aims to make goal-directed plans given the current visual observations in unstructured real-life videos.
arXiv Detail & Related papers (2023-03-26T10:50:16Z) - Learning and Verification of Task Structure in Instructional Videos [85.511888642497]
We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos.
Compared to prior work which learns step representations locally, our approach involves learning them globally.
We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order.
arXiv Detail & Related papers (2023-03-23T17:59:54Z) - Procedure Planning in Instructional Videosvia Contextual Modeling and
Model-based Policy Learning [114.1830997893756]
This work focuses on learning a model to plan goal-directed actions in real-life videos.
We propose novel algorithms to model human behaviors through Bayesian Inference and model-based Imitation Learning.
arXiv Detail & Related papers (2021-10-05T01:06:53Z) - Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation.
Our framework can be trained without the help of any manual annotation or pretrained network.
Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.