Understanding Multimodal Procedural Knowledge by Sequencing Multimodal
Instructional Manuals
- URL: http://arxiv.org/abs/2110.08486v4
- Date: Tue, 20 Feb 2024 22:22:00 GMT
- Title: Understanding Multimodal Procedural Knowledge by Sequencing Multimodal
Instructional Manuals
- Authors: Te-Lin Wu, Alex Spangher, Pegah Alipoormolabashi, Marjorie Freedman,
Ralph Weischedel, Nanyun Peng
- Abstract summary: We benchmark machine learning models' capability of reasoning over and sequencing unordered multimodal instructions.
We find models not only perform significantly worse than humans but also seem incapable of efficiently utilizing the multimodal information.
We propose sequentiality-aware pretraining techniques that exploit the sequential alignment properties of both texts and images.
- Score: 48.55362590292391
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to sequence unordered events is an essential skill to comprehend
and reason about real world task procedures, which often requires thorough
understanding of temporal common sense and multimodal information, as these
procedures are often communicated through a combination of texts and images.
Such capability is essential for applications such as sequential task planning
and multi-source instruction summarization. While humans are capable of
reasoning about and sequencing unordered multimodal procedural instructions,
whether current machine learning models have such essential capability is still
an open question. In this work, we benchmark models' capability of reasoning
over and sequencing unordered multimodal instructions by curating datasets from
popular online instructional manuals and collecting comprehensive human
annotations. We find models not only perform significantly worse than humans
but also seem incapable of efficiently utilizing the multimodal information. To
improve machines' performance on multimodal event sequencing, we propose
sequentiality-aware pretraining techniques that exploit the sequential
alignment properties of both texts and images, resulting in > 5% significant
improvements.
Related papers
- From Symbolic Tasks to Code Generation: Diversification Yields Better Task Performers [1.6958018695660049]
We show that a more diverse instruction set, extending beyond code-related tasks, improves the performance of code generation.
Our observations suggest that a more diverse semantic space for instruction-tuning sets greatly improves the model's ability to follow instructions and perform tasks.
arXiv Detail & Related papers (2024-05-30T07:54:07Z) - Fine-tuning Large Language Models with Sequential Instructions [2.546845645875049]
We find that existing instruction-tuned models struggle to respond to queries with multiple instructions.
We contend that part of the fine-tuning data mixture should be sequential--containing a chain of interrelated tasks.
We automate this process by turning instructions in existing datasets into diverse and complex sequential instructions.
Models that underwent our sequential instruction tuning show improved results in coding, maths, and open-ended generation.
arXiv Detail & Related papers (2024-03-12T16:33:30Z) - Towards Robust Instruction Tuning on Multimodal Large Language Models [25.506776502317436]
In this work, we introduce an automatic instruction augmentation method named INSTRAUG in multimodal tasks.
Results on two popular multimodal instructionfollowing benchmarks show that INSTRAUG can significantly improve the alignment of multimodal large language models (MLLMs) across 12 multimodal tasks.
arXiv Detail & Related papers (2024-02-22T12:35:50Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts.
This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals.
We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z) - Diffusion Model is an Effective Planner and Data Synthesizer for
Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis.
For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z) - Pre-training Multi-task Contrastive Learning Models for Scientific
Literature Understanding [52.723297744257536]
Pre-trained language models (LMs) have shown effectiveness in scientific literature understanding tasks.
We propose a multi-task contrastive learning framework, SciMult, to facilitate common knowledge sharing across different literature understanding tasks.
arXiv Detail & Related papers (2023-05-23T16:47:22Z) - Multi-Modal Experience Inspired AI Creation [33.34566822058209]
We study how to generate texts based on sequential multi-modal information.
We firstly design a multi-channel sequence-to-sequence architecture equipped with a multi-modal attention network.
We then propose a curriculum negative sampling strategy tailored for the sequential inputs.
arXiv Detail & Related papers (2022-09-02T11:50:41Z) - Temporally Correlated Task Scheduling for Sequence Learning [143.70523777803723]
In many applications, a sequence learning task is usually associated with multiple temporally correlated auxiliary tasks.
We introduce a learnable scheduler to sequence learning, which can adaptively select auxiliary tasks for training.
Our method significantly improves the performance of simultaneous machine translation and stock trend forecasting.
arXiv Detail & Related papers (2020-07-10T10:28:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.