Graph2Vid: Flow graph to Video Grounding forWeakly-supervised Multi-Step
Localization
- URL: http://arxiv.org/abs/2210.04996v1
- Date: Mon, 10 Oct 2022 20:02:58 GMT
- Title: Graph2Vid: Flow graph to Video Grounding forWeakly-supervised Multi-Step
Localization
- Authors: Nikita Dvornik, Isma Hadji, Hai Pham, Dhaivat Bhatt, Brais Martinez,
Afsaneh Fazly, Allan D. Jepson
- Abstract summary: We consider the problem of weakly-supervised multi-step localization in instructional videos.
An established approach to this problem is to rely on a given list of steps.
We propose a new algorithm - Graph2Vid - that infers the actual ordering of steps in the video and simultaneously localizes them.
- Score: 14.95378874133603
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we consider the problem of weakly-supervised multi-step
localization in instructional videos. An established approach to this problem
is to rely on a given list of steps. However, in reality, there is often more
than one way to execute a procedure successfully, by following the set of steps
in slightly varying orders. Thus, for successful localization in a given video,
recent works require the actual order of procedure steps in the video, to be
provided by human annotators at both training and test times. Instead, here, we
only rely on generic procedural text that is not tied to a specific video. We
represent the various ways to complete the procedure by transforming the list
of instructions into a procedure flow graph which captures the partial order of
steps. Using the flow graphs reduces both training and test time annotation
requirements. To this end, we introduce the new problem of flow graph to video
grounding. In this setup, we seek the optimal step ordering consistent with the
procedure flow graph and a given video. To solve this problem, we propose a new
algorithm - Graph2Vid - that infers the actual ordering of steps in the video
and simultaneously localizes them. To show the advantage of our proposed
formulation, we extend the CrossTask dataset with procedure flow graph
information. Our experiments show that Graph2Vid is both more efficient than
the baselines and yields strong step localization results, without the need for
step order annotation.
Related papers
- Box2Flow: Instance-based Action Flow Graphs from Videos [16.07460333800912]
Flow graphs can be used to illustrate the step relationships of a task.
Current task-based methods try to learn a single flow graph for all available videos of a specific task.
We propose Box2Flow, an instance-based method to predict a step flow graph from a given procedural video.
arXiv Detail & Related papers (2024-08-30T23:33:19Z) - Non-Sequential Graph Script Induction via Multimedia Grounding [129.83134296316493]
We train a script knowledge model capable of both generating explicit graph scripts for learnt tasks and predicting future steps given a partial step sequence.
Human evaluation shows our model outperforming the WikiHow linear baseline by 48.76% absolute gains in capturing sequential and non-sequential step relationships.
arXiv Detail & Related papers (2023-05-27T18:13:17Z) - StepFormer: Self-supervised Step Discovery and Localization in
Instructional Videos [47.03252542488226]
We introduce StepFormer, a self-supervised model that discovers and localizes instruction steps in a video.
We train our system on a large dataset of instructional videos, using their automatically-generated subtitles as the only source of supervision.
Our model outperforms all previous unsupervised and weakly-supervised approaches on step detection and localization.
arXiv Detail & Related papers (2023-04-26T03:37:28Z) - Procedure-Aware Pretraining for Instructional Video Understanding [58.214549181779006]
Key challenge in procedure understanding is to be able to extract from unlabeled videos the procedural knowledge.
Our main insight is that instructional videos depict sequences of steps that repeat between instances of the same or different tasks.
This graph can then be used to generate pseudo labels to train a video representation that encodes the procedural knowledge in a more accessible form.
arXiv Detail & Related papers (2023-03-31T17:41:31Z) - Learning and Verification of Task Structure in Instructional Videos [85.511888642497]
We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos.
Compared to prior work which learns step representations locally, our approach involves learning them globally.
We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order.
arXiv Detail & Related papers (2023-03-23T17:59:54Z) - Learning To Recognize Procedural Activities with Distant Supervision [96.58436002052466]
We consider the problem of classifying fine-grained, multi-step activities from long videos spanning up to several minutes.
Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base.
arXiv Detail & Related papers (2022-01-26T15:06:28Z) - SVIP: Sequence VerIfication for Procedures in Videos [68.07865790764237]
We propose a novel sequence verification task that aims to distinguish positive video pairs performing the same action sequence from negative ones with step-level transformations.
Such a challenging task resides in an open-set setting without prior action detection or segmentation.
We collect a scripted video dataset enumerating all kinds of step-level transformations in chemical experiments.
arXiv Detail & Related papers (2021-12-13T07:03:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.