Multimodal Subtask Graph Generation from Instructional Videos
- URL: http://arxiv.org/abs/2302.08672v1
- Date: Fri, 17 Feb 2023 03:41:38 GMT
- Title: Multimodal Subtask Graph Generation from Instructional Videos
- Authors: Yunseok Jang, Sungryull Sohn, Lajanugen Logeswaran, Tiange Luo,
Moontae Lee, Honglak Lee
- Abstract summary: Real-world tasks consist of multiple inter-dependent subtasks.
In this work, we aim to model the causal dependencies between such subtasks from instructional videos describing the task.
We present Multimodal Subtask Graph Generation (MSG2), an approach that constructs a Subtask Graph defining the dependency between a task's subtasks relevant to a task from noisy web videos.
- Score: 51.96856868195961
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-world tasks consist of multiple inter-dependent subtasks (e.g., a dirty
pan needs to be washed before it can be used for cooking). In this work, we aim
to model the causal dependencies between such subtasks from instructional
videos describing the task. This is a challenging problem since complete
information about the world is often inaccessible from videos, which demands
robust learning mechanisms to understand the causal structure of events. We
present Multimodal Subtask Graph Generation (MSG2), an approach that constructs
a Subtask Graph defining the dependency between a task's subtasks relevant to a
task from noisy web videos. Graphs generated by our multimodal approach are
closer to human-annotated graphs compared to prior approaches. MSG2 further
performs the downstream task of next subtask prediction 85% and 30% more
accurately than recent video transformer models in the ProceL and CrossTask
datasets, respectively.
Related papers
- DLM-VMTL:A Double Layer Mapper for heterogeneous data video Multi-task prompt learning [2.4121373594852846]
Multi-Task Learning makes the visual task acquire the rich shareable knowledge from other tasks while joint training.
Heterogenous data video multi-task prompt learning (VMTL) method is proposed to address above problem.
Double-Layers Mapper(DLM) is proposed to extract the shareable knowledge into visual promptS and align it with representation of primary task.
arXiv Detail & Related papers (2024-08-29T01:25:36Z) - Unsupervised Task Graph Generation from Instructional Video Transcripts [53.54435048879365]
We consider a setting where text transcripts of instructional videos performing a real-world activity are provided.
The goal is to identify the key steps relevant to the task as well as the dependency relationship between these key steps.
We propose a novel task graph generation approach that combines the reasoning capabilities of instruction-tuned language models along with clustering and ranking components.
arXiv Detail & Related papers (2023-02-17T22:50:08Z) - MINOTAUR: Multi-task Video Grounding From Multimodal Queries [70.08973664126873]
We present a single, unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark.
arXiv Detail & Related papers (2023-02-16T04:00:03Z) - Fast Inference and Transfer of Compositional Task Structures for
Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph.
Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks.
Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z) - Graph Representation Learning for Multi-Task Settings: a Meta-Learning
Approach [5.629161809575013]
We propose a novel training strategy for graph representation learning, based on meta-learning.
Our method avoids the difficulties arising when learning to perform multiple tasks concurrently.
We show that the embeddings produced by a model trained with our method can be used to perform multiple tasks with comparable or, surprisingly, even higher performance than both single-task and multi-task end-to-end models.
arXiv Detail & Related papers (2022-01-10T12:58:46Z) - Multi-Relational Graph based Heterogeneous Multi-Task Learning in
Community Question Answering [28.91133131424694]
We develop a multi-relational graph based Multi-Task Learning model called Heterogeneous Multi-Task Graph Isomorphism Network (HMTGIN)
In each training forward pass, HMTGIN embeds the input CQA forum graph by an extension of Graph Isomorphism Network and skip connections.
In the evaluation, the embeddings are shared among different task-specific output layers to make corresponding predictions.
arXiv Detail & Related papers (2021-09-04T03:19:20Z) - Low Resource Multi-Task Sequence Tagging -- Revisiting Dynamic
Conditional Random Fields [67.51177964010967]
We compare different models for low resource multi-task sequence tagging that leverage dependencies between label sequences for different tasks.
We find that explicit modeling of inter-dependencies between task predictions outperforms single-task as well as standard multi-task models.
arXiv Detail & Related papers (2020-05-01T07:11:34Z) - MTI-Net: Multi-Scale Task Interaction Networks for Multi-Task Learning [82.62433731378455]
We show that tasks with high affinity at a certain scale are not guaranteed to retain this behaviour at other scales.
We propose a novel architecture, namely MTI-Net, that builds upon this finding.
arXiv Detail & Related papers (2020-01-19T21:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.