MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain
Everyday Tasks
- URL: http://arxiv.org/abs/2310.04965v2
- Date: Thu, 18 Jan 2024 21:17:04 GMT
- Title: MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain
Everyday Tasks
- Authors: Jingyuan Qi, Minqian Liu, Ying Shen, Zhiyang Xu, Lifu Huang
- Abstract summary: We present a new benchmark challenge -- MultiScript.
For both tasks, the input consists of a target task name and a video illustrating what has been done to complete the target task.
The expected output is (1) a sequence of structured step descriptions in text based on the demonstration video, and (2) a single text description for the subsequent step.
- Score: 28.27986773292919
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatically generating scripts (i.e. sequences of key steps described in
text) from video demonstrations and reasoning about the subsequent steps are
crucial to the modern AI virtual assistants to guide humans to complete
everyday tasks, especially unfamiliar ones. However, current methods for
generative script learning rely heavily on well-structured preceding steps
described in text and/or images or are limited to a certain domain, resulting
in a disparity with real-world user scenarios. To address these limitations, we
present a new benchmark challenge -- MultiScript, with two new tasks on
task-oriented multimodal script learning: (1) multimodal script generation, and
(2) subsequent step prediction. For both tasks, the input consists of a target
task name and a video illustrating what has been done to complete the target
task, and the expected output is (1) a sequence of structured step descriptions
in text based on the demonstration video, and (2) a single text description for
the subsequent step, respectively. Built from WikiHow, MultiScript covers
multimodal scripts in videos and text descriptions for over 6,655 human
everyday tasks across 19 diverse domains. To establish baseline performance on
MultiScript, we propose two knowledge-guided multimodal generative frameworks
that incorporate the task-related knowledge prompted from large language models
such as Vicuna. Experimental results show that our proposed approaches
significantly improve over the competitive baselines.
Related papers
- VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [93.85005277463802]
VisualWebArena is a benchmark designed to assess the performance of multimodal web agents on realistic tasks.
To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives.
arXiv Detail & Related papers (2024-01-24T18:35:21Z) - Multitask Multimodal Prompted Training for Interactive Embodied Task
Completion [48.69347134411864]
Embodied MultiModal Agent (EMMA) is a unified encoder-decoder model that reasons over images and trajectories.
By unifying all tasks as text generation, EMMA learns a language of actions which facilitates transfer across tasks.
arXiv Detail & Related papers (2023-11-07T15:27:52Z) - TransPrompt v2: A Transferable Prompting Framework for Cross-task Text
Classification [37.824031151922604]
We propose TransPrompt v2, a novel transferable prompting framework for few-shot learning across similar or distant text classification tasks.
For learning across similar tasks, we employ a multi-task meta-knowledge acquisition (MMA) procedure to train a meta-learner.
For learning across distant tasks, we inject the task type descriptions into the prompt, and capture the intra-type and inter-type prompt embeddings.
arXiv Detail & Related papers (2023-08-29T04:16:57Z) - Non-Sequential Graph Script Induction via Multimedia Grounding [129.83134296316493]
We train a script knowledge model capable of both generating explicit graph scripts for learnt tasks and predicting future steps given a partial step sequence.
Human evaluation shows our model outperforming the WikiHow linear baseline by 48.76% absolute gains in capturing sequential and non-sequential step relationships.
arXiv Detail & Related papers (2023-05-27T18:13:17Z) - Multimedia Generative Script Learning for Task Planning [58.73725388387305]
We propose a new task, Multimedia Generative Script Learning, to generate subsequent steps by tracking historical states in both text and vision modalities.
This task is challenging in three aspects: the multimedia challenge of capturing the visual states in images, the induction challenge of performing unseen tasks, and the diversity challenge of covering different information in individual steps.
Experiment results demonstrate that our approach significantly outperforms strong baselines.
arXiv Detail & Related papers (2022-08-25T19:04:28Z) - Goal-Oriented Script Construction [23.6227797113877]
We propose the Goal-Oriented Script Construction task, where a model produces a sequence of steps to accomplish a given goal.
We pilot our task on the first multilingual script learning dataset supporting 18 languages collected from wikiHow.
arXiv Detail & Related papers (2021-07-28T06:39:31Z) - VLM: Task-agnostic Video-Language Model Pre-training for Video
Understanding [78.28397557433544]
We present a task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks.
Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training.
arXiv Detail & Related papers (2021-05-20T19:13:27Z) - proScript: Partially Ordered Scripts Generation via Pre-trained Language
Models [49.03193243699244]
We demonstrate for the first time that pre-trained neural language models (LMs) can be finetuned to generate high-quality scripts.
We collected a large (6.4k), crowdsourced partially ordered scripts (named proScript)
Our experiments show that our models perform well (e.g., F1=75.7 in task (i)), illustrating a new approach to overcoming previous barriers to script collection.
arXiv Detail & Related papers (2021-04-16T17:35:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.