Non-Sequential Graph Script Induction via Multimedia Grounding
- URL: http://arxiv.org/abs/2305.17542v1
- Date: Sat, 27 May 2023 18:13:17 GMT
- Title: Non-Sequential Graph Script Induction via Multimedia Grounding
- Authors: Yu Zhou, Sha Li, Manling Li, Xudong Lin, Shih-Fu Chang, Mohit Bansal
and Heng Ji
- Abstract summary: We train a script knowledge model capable of both generating explicit graph scripts for learnt tasks and predicting future steps given a partial step sequence.
Human evaluation shows our model outperforming the WikiHow linear baseline by 48.76% absolute gains in capturing sequential and non-sequential step relationships.
- Score: 129.83134296316493
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online resources such as WikiHow compile a wide range of scripts for
performing everyday tasks, which can assist models in learning to reason about
procedures. However, the scripts are always presented in a linear manner, which
does not reflect the flexibility displayed by people executing tasks in real
life. For example, in the CrossTask Dataset, 64.5% of consecutive step pairs
are also observed in the reverse order, suggesting their ordering is not fixed.
In addition, each step has an average of 2.56 frequent next steps,
demonstrating "branching". In this paper, we propose the new challenging task
of non-sequential graph script induction, aiming to capture optional and
interchangeable steps in procedural planning. To automate the induction of such
graph scripts for given tasks, we propose to take advantage of loosely aligned
videos of people performing the tasks. In particular, we design a multimodal
framework to ground procedural videos to WikiHow textual steps and thus
transform each video into an observed step path on the latent ground truth
graph script. This key transformation enables us to train a script knowledge
model capable of both generating explicit graph scripts for learnt tasks and
predicting future steps given a partial step sequence. Our best model
outperforms the strongest pure text/vision baselines by 17.52% absolute gains
on F1@3 for next step prediction and 13.8% absolute gains on Acc@1 for partial
sequence completion. Human evaluation shows our model outperforming the WikiHow
linear baseline by 48.76% absolute gains in capturing sequential and
non-sequential step relationships.
Related papers
- Box2Flow: Instance-based Action Flow Graphs from Videos [16.07460333800912]
Flow graphs can be used to illustrate the step relationships of a task.
Current task-based methods try to learn a single flow graph for all available videos of a specific task.
We propose Box2Flow, an instance-based method to predict a step flow graph from a given procedural video.
arXiv Detail & Related papers (2024-08-30T23:33:19Z) - Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos [13.99137623722021]
Procedural activities are sequences of key-steps aimed at achieving specific goals.
Task graphs have emerged as a human-understandable representation of procedural activities.
arXiv Detail & Related papers (2024-06-03T16:11:39Z) - MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain
Everyday Tasks [28.27986773292919]
We present a new benchmark challenge -- MultiScript.
For both tasks, the input consists of a target task name and a video illustrating what has been done to complete the target task.
The expected output is (1) a sequence of structured step descriptions in text based on the demonstration video, and (2) a single text description for the subsequent step.
arXiv Detail & Related papers (2023-10-08T01:51:17Z) - Video-Mined Task Graphs for Keystep Recognition in Instructional Videos [71.16703750980143]
Procedural activity understanding requires perceiving human actions in terms of a broader task.
We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps.
We show the impact: more reliable zero-shot keystep localization and improved video representation learning.
arXiv Detail & Related papers (2023-07-17T18:19:36Z) - Learning to Ground Instructional Articles in Videos through Narrations [50.3463147014498]
We present an approach for localizing steps of procedural activities in narrated how-to videos.
We source the step descriptions from a language knowledge base (wikiHow) containing instructional articles.
Our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities.
arXiv Detail & Related papers (2023-06-06T15:45:53Z) - Procedure-Aware Pretraining for Instructional Video Understanding [58.214549181779006]
Key challenge in procedure understanding is to be able to extract from unlabeled videos the procedural knowledge.
Our main insight is that instructional videos depict sequences of steps that repeat between instances of the same or different tasks.
This graph can then be used to generate pseudo labels to train a video representation that encodes the procedural knowledge in a more accessible form.
arXiv Detail & Related papers (2023-03-31T17:41:31Z) - Learning and Verification of Task Structure in Instructional Videos [85.511888642497]
We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos.
Compared to prior work which learns step representations locally, our approach involves learning them globally.
We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order.
arXiv Detail & Related papers (2023-03-23T17:59:54Z) - Graph2Vid: Flow graph to Video Grounding forWeakly-supervised Multi-Step
Localization [14.95378874133603]
We consider the problem of weakly-supervised multi-step localization in instructional videos.
An established approach to this problem is to rely on a given list of steps.
We propose a new algorithm - Graph2Vid - that infers the actual ordering of steps in the video and simultaneously localizes them.
arXiv Detail & Related papers (2022-10-10T20:02:58Z) - SVIP: Sequence VerIfication for Procedures in Videos [68.07865790764237]
We propose a novel sequence verification task that aims to distinguish positive video pairs performing the same action sequence from negative ones with step-level transformations.
Such a challenging task resides in an open-set setting without prior action detection or segmentation.
We collect a scripted video dataset enumerating all kinds of step-level transformations in chemical experiments.
arXiv Detail & Related papers (2021-12-13T07:03:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.