Video-Mined Task Graphs for Keystep Recognition in Instructional Videos
- URL: http://arxiv.org/abs/2307.08763v2
- Date: Sun, 29 Oct 2023 04:16:11 GMT
- Title: Video-Mined Task Graphs for Keystep Recognition in Instructional Videos
- Authors: Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Triantafyllos Afouras,
Kristen Grauman
- Abstract summary: Procedural activity understanding requires perceiving human actions in terms of a broader task.
We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps.
We show the impact: more reliable zero-shot keystep localization and improved video representation learning.
- Score: 71.16703750980143
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Procedural activity understanding requires perceiving human actions in terms
of a broader task, where multiple keysteps are performed in sequence across a
long video to reach a final goal state -- such as the steps of a recipe or a
DIY fix-it task. Prior work largely treats keystep recognition in isolation of
this broader structure, or else rigidly confines keysteps to align with a
predefined sequential script. We propose discovering a task graph automatically
from how-to videos to represent probabilistically how people tend to execute
keysteps, and then leverage this graph to regularize keystep recognition in
novel videos. On multiple datasets of real-world instructional videos, we show
the impact: more reliable zero-shot keystep localization and improved video
representation learning, exceeding the state of the art.
Related papers
- Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos [13.99137623722021]
Procedural activities are sequences of key-steps aimed at achieving specific goals.
Task graphs have emerged as a human-understandable representation of procedural activities.
arXiv Detail & Related papers (2024-06-03T16:11:39Z) - Non-Sequential Graph Script Induction via Multimedia Grounding [129.83134296316493]
We train a script knowledge model capable of both generating explicit graph scripts for learnt tasks and predicting future steps given a partial step sequence.
Human evaluation shows our model outperforming the WikiHow linear baseline by 48.76% absolute gains in capturing sequential and non-sequential step relationships.
arXiv Detail & Related papers (2023-05-27T18:13:17Z) - StepFormer: Self-supervised Step Discovery and Localization in
Instructional Videos [47.03252542488226]
We introduce StepFormer, a self-supervised model that discovers and localizes instruction steps in a video.
We train our system on a large dataset of instructional videos, using their automatically-generated subtitles as the only source of supervision.
Our model outperforms all previous unsupervised and weakly-supervised approaches on step detection and localization.
arXiv Detail & Related papers (2023-04-26T03:37:28Z) - Procedure-Aware Pretraining for Instructional Video Understanding [58.214549181779006]
Key challenge in procedure understanding is to be able to extract from unlabeled videos the procedural knowledge.
Our main insight is that instructional videos depict sequences of steps that repeat between instances of the same or different tasks.
This graph can then be used to generate pseudo labels to train a video representation that encodes the procedural knowledge in a more accessible form.
arXiv Detail & Related papers (2023-03-31T17:41:31Z) - Learning and Verification of Task Structure in Instructional Videos [85.511888642497]
We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos.
Compared to prior work which learns step representations locally, our approach involves learning them globally.
We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order.
arXiv Detail & Related papers (2023-03-23T17:59:54Z) - Learning To Recognize Procedural Activities with Distant Supervision [96.58436002052466]
We consider the problem of classifying fine-grained, multi-step activities from long videos spanning up to several minutes.
Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base.
arXiv Detail & Related papers (2022-01-26T15:06:28Z) - Deep Multimodal Feature Encoding for Video Ordering [34.27175264084648]
We present a way to learn a compact multimodal feature representation that encodes all these modalities.
Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline.
We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition.
arXiv Detail & Related papers (2020-04-05T14:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.