Multimedia Generative Script Learning for Task Planning
- URL: http://arxiv.org/abs/2208.12306v3
- Date: Mon, 10 Jul 2023 16:51:34 GMT
- Title: Multimedia Generative Script Learning for Task Planning
- Authors: Qingyun Wang, Manling Li, Hou Pong Chan, Lifu Huang, Julia
Hockenmaier, Girish Chowdhary, Heng Ji
- Abstract summary: We propose a new task, Multimedia Generative Script Learning, to generate subsequent steps by tracking historical states in both text and vision modalities.
This task is challenging in three aspects: the multimedia challenge of capturing the visual states in images, the induction challenge of performing unseen tasks, and the diversity challenge of covering different information in individual steps.
Experiment results demonstrate that our approach significantly outperforms strong baselines.
- Score: 58.73725388387305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Goal-oriented generative script learning aims to generate subsequent steps to
reach a particular goal, which is an essential task to assist robots or humans
in performing stereotypical activities. An important aspect of this process is
the ability to capture historical states visually, which provides detailed
information that is not covered by text and will guide subsequent steps.
Therefore, we propose a new task, Multimedia Generative Script Learning, to
generate subsequent steps by tracking historical states in both text and vision
modalities, as well as presenting the first benchmark containing 5,652 tasks
and 79,089 multimedia steps. This task is challenging in three aspects: the
multimedia challenge of capturing the visual states in images, the induction
challenge of performing unseen tasks, and the diversity challenge of covering
different information in individual steps. We propose to encode visual state
changes through a selective multimedia encoder to address the multimedia
challenge, transfer knowledge from previously observed tasks using a
retrieval-augmented decoder to overcome the induction challenge, and further
present distinct information at each step by optimizing a diversity-oriented
contrastive learning objective. We define metrics to evaluate both generation
and inductive quality. Experiment results demonstrate that our approach
significantly outperforms strong baselines.
Related papers
- MAML MOT: Multiple Object Tracking based on Meta-Learning [7.892321926673001]
MAML MOT is a meta-learning-based training approach for multi-object tracking.
We introduce MAML MOT, a meta-learning-based training approach for multi-object tracking.
arXiv Detail & Related papers (2024-05-12T12:38:40Z) - Read to Play (R2-Play): Decision Transformer with Multimodal Game Instruction [22.31940101833938]
This paper explores enhanced forms of task guidance for agents, enabling them to comprehend gameplay instructions.
We construct a set of multimodal game instructions to incorporate instruction tuning into a decision transformer.
arXiv Detail & Related papers (2024-02-06T17:09:25Z) - Masked Diffusion with Task-awareness for Procedure Planning in
Instructional Videos [16.93979476655776]
A key challenge with procedure planning in instructional videos is how to handle a large decision space consisting of a multitude of action types.
We introduce a simple yet effective enhancement - a masked diffusion model.
We learn a joint visual-text embedding, where a text embedding is generated by prompting a pre-trained vision-language model to focus on human actions.
arXiv Detail & Related papers (2023-09-14T03:25:37Z) - Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [79.20605034378187]
Video-language pre-trained models have shown remarkable success in guiding video question-answering tasks.
Due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones.
This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
arXiv Detail & Related papers (2023-08-16T15:00:50Z) - Learning and Verification of Task Structure in Instructional Videos [85.511888642497]
We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos.
Compared to prior work which learns step representations locally, our approach involves learning them globally.
We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order.
arXiv Detail & Related papers (2023-03-23T17:59:54Z) - Unsupervised Task Graph Generation from Instructional Video Transcripts [53.54435048879365]
We consider a setting where text transcripts of instructional videos performing a real-world activity are provided.
The goal is to identify the key steps relevant to the task as well as the dependency relationship between these key steps.
We propose a novel task graph generation approach that combines the reasoning capabilities of instruction-tuned language models along with clustering and ranking components.
arXiv Detail & Related papers (2023-02-17T22:50:08Z) - Variational Multi-Task Learning with Gumbel-Softmax Priors [105.22406384964144]
Multi-task learning aims to explore task relatedness to improve individual tasks.
We propose variational multi-task learning (VMTL), a general probabilistic inference framework for learning multiple related tasks.
arXiv Detail & Related papers (2021-11-09T18:49:45Z) - Episodic Transformer for Vision-and-Language Navigation [142.6236659368177]
This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions.
We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.
Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
arXiv Detail & Related papers (2021-05-13T17:51:46Z) - Visual Goal-Step Inference using wikiHow [29.901908251322684]
Inferring the sub-sequence of steps of a goal can help artificial intelligence systems reason about human activities.
We propose the Visual Goal-Step Inference (VGSI) task where a model is given a textual goal and must choose a plausible step towards that goal from among four candidate images.
We show that the knowledge learned from our data can effectively transfer to other datasets like HowTo100M, increasing the multiple-choice accuracy by 15% to 20%.
arXiv Detail & Related papers (2021-04-12T22:20:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.