Box2Flow: Instance-based Action Flow Graphs from Videos
- URL: http://arxiv.org/abs/2409.00295v1
- Date: Fri, 30 Aug 2024 23:33:19 GMT
- Title: Box2Flow: Instance-based Action Flow Graphs from Videos
- Authors: Jiatong Li, Kalliopi Basioti, Vladimir Pavlovic,
- Abstract summary: Flow graphs can be used to illustrate the step relationships of a task.
Current task-based methods try to learn a single flow graph for all available videos of a specific task.
We propose Box2Flow, an instance-based method to predict a step flow graph from a given procedural video.
- Score: 16.07460333800912
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A large amount of procedural videos on the web show how to complete various tasks. These tasks can often be accomplished in different ways and step orderings, with some steps able to be performed simultaneously, while others are constrained to be completed in a specific order. Flow graphs can be used to illustrate the step relationships of a task. Current task-based methods try to learn a single flow graph for all available videos of a specific task. The extracted flow graphs tend to be too abstract, failing to capture detailed step descriptions. In this work, our aim is to learn accurate and rich flow graphs by extracting them from a single video. We propose Box2Flow, an instance-based method to predict a step flow graph from a given procedural video. In detail, we extract bounding boxes from videos, predict pairwise edge probabilities between step pairs, and build the flow graph with a spanning tree algorithm. Experiments on MM-ReS and YouCookII show our method can extract flow graphs effectively.
Related papers
- InstructG2I: Synthesizing Images from Multimodal Attributed Graphs [50.852150521561676]
We propose a graph context-conditioned diffusion model called InstructG2I.
InstructG2I first exploits the graph structure and multimodal information to conduct informative neighbor sampling.
A Graph-QFormer encoder adaptively encodes the graph nodes into an auxiliary set of graph prompts to guide the denoising process.
arXiv Detail & Related papers (2024-10-09T17:56:15Z) - VideoSAGE: Video Summarization with Graph Representation Learning [9.21019970479227]
We propose a graph-based representation learning framework for video summarization.
A graph constructed this way aims to capture long-range interactions among video frames, and the sparsity ensures the model trains without hitting the memory and compute bottleneck.
arXiv Detail & Related papers (2024-04-14T15:49:02Z) - All in One: Multi-task Prompting for Graph Neural Networks [30.457491401821652]
We propose a novel multi-task prompting method for graph models.
We first unify the format of graph prompts and language prompts with the prompt token, token structure, and inserting pattern.
We then study the task space of various graph applications and reformulate downstream problems to the graph-level task.
arXiv Detail & Related papers (2023-07-04T06:27:31Z) - Non-Sequential Graph Script Induction via Multimedia Grounding [129.83134296316493]
We train a script knowledge model capable of both generating explicit graph scripts for learnt tasks and predicting future steps given a partial step sequence.
Human evaluation shows our model outperforming the WikiHow linear baseline by 48.76% absolute gains in capturing sequential and non-sequential step relationships.
arXiv Detail & Related papers (2023-05-27T18:13:17Z) - Procedure-Aware Pretraining for Instructional Video Understanding [58.214549181779006]
Key challenge in procedure understanding is to be able to extract from unlabeled videos the procedural knowledge.
Our main insight is that instructional videos depict sequences of steps that repeat between instances of the same or different tasks.
This graph can then be used to generate pseudo labels to train a video representation that encodes the procedural knowledge in a more accessible form.
arXiv Detail & Related papers (2023-03-31T17:41:31Z) - Multimodal Subtask Graph Generation from Instructional Videos [51.96856868195961]
Real-world tasks consist of multiple inter-dependent subtasks.
In this work, we aim to model the causal dependencies between such subtasks from instructional videos describing the task.
We present Multimodal Subtask Graph Generation (MSG2), an approach that constructs a Subtask Graph defining the dependency between a task's subtasks relevant to a task from noisy web videos.
arXiv Detail & Related papers (2023-02-17T03:41:38Z) - Graph2Vid: Flow graph to Video Grounding forWeakly-supervised Multi-Step
Localization [14.95378874133603]
We consider the problem of weakly-supervised multi-step localization in instructional videos.
An established approach to this problem is to rely on a given list of steps.
We propose a new algorithm - Graph2Vid - that infers the actual ordering of steps in the video and simultaneously localizes them.
arXiv Detail & Related papers (2022-10-10T20:02:58Z) - Learnable Graph Matching: Incorporating Graph Partitioning with Deep
Feature Learning for Multiple Object Tracking [58.30147362745852]
Data association across frames is at the core of Multiple Object Tracking (MOT) task.
Existing methods mostly ignore the context information among tracklets and intra-frame detections.
We propose a novel learnable graph matching method to address these issues.
arXiv Detail & Related papers (2021-03-30T08:58:45Z) - Flow-edge Guided Video Completion [66.49077223104533]
Previous flow completion methods are often unable to retain the sharpness of motion boundaries.
Our method first extracts and completes motion edges, and then uses them to guide piecewise-smooth flow completion with sharp edges.
arXiv Detail & Related papers (2020-09-03T17:59:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.