Show and Guide: Instructional-Plan Grounded Vision and Language Model
- URL: http://arxiv.org/abs/2409.19074v3
- Date: Fri, 18 Oct 2024 23:27:44 GMT
- Title: Show and Guide: Instructional-Plan Grounded Vision and Language Model
- Authors: Diogo Glória-Silva, David Semedo, João Magalhães,
- Abstract summary: We present MM-PlanLLM, the first multimodal plan-following language model.
We bring cross-modality through two key tasks: Conversational Video Moment Retrieval and Visually-Informed Step Generation.
MM-PlanLLM is trained using a novel multitask-multistage approach.
- Score: 9.84151565227816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Guiding users through complex procedural plans is an inherently multimodal task in which having visually illustrated plan steps is crucial to deliver an effective plan guidance. However, existing works on plan-following language models (LMs) often are not capable of multimodal input and output. In this work, we present MM-PlanLLM, the first multimodal LLM designed to assist users in executing instructional tasks by leveraging both textual plans and visual information. Specifically, we bring cross-modality through two key tasks: Conversational Video Moment Retrieval, where the model retrieves relevant step-video segments based on user queries, and Visually-Informed Step Generation, where the model generates the next step in a plan, conditioned on an image of the user's current progress. MM-PlanLLM is trained using a novel multitask-multistage approach, designed to gradually expose the model to multimodal instructional-plans semantic layers, achieving strong performance on both multimodal and textual dialogue in a plan-grounded setting. Furthermore, we show that the model delivers cross-modal temporal and plan-structure representations aligned between textual plan steps and instructional video moments.
Related papers
- Learning Task Planning from Multi-Modal Demonstration for Multi-Stage Contact-Rich Manipulation [26.540648608911308]
In this paper, we introduce an in-context learning framework that incorporates tactile and force-torque information from human demonstrations.
We propose a bootstrapped reasoning pipeline that sequentially integrates each modality into a comprehensive task plan.
This task plan is then used as a reference for planning in new task configurations.
arXiv Detail & Related papers (2024-09-18T10:36:47Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [60.17448025069594]
We investigate the potential of Large Language Models to enhance multimodal representation in multimodal item-to-item recommendations.
One feasible method is the transfer of Multimodal Large Language Models (MLLMs) for representation tasks.
We propose a novel training framework, NoteLLM-2, specifically designed for multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning [8.1113308714581]
This paper introduces a novel multimodal chart question-answering model.
Our model integrates visual and linguistic processing, overcoming the constraints of existing methods.
This approach has demonstrated superior performance on multiple public datasets.
arXiv Detail & Related papers (2024-04-02T01:28:44Z) - Multitask Multimodal Prompted Training for Interactive Embodied Task
Completion [48.69347134411864]
Embodied MultiModal Agent (EMMA) is a unified encoder-decoder model that reasons over images and trajectories.
By unifying all tasks as text generation, EMMA learns a language of actions which facilitates transfer across tasks.
arXiv Detail & Related papers (2023-11-07T15:27:52Z) - Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts.
This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals.
We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z) - Compositional Foundation Models for Hierarchical Planning [52.18904315515153]
We propose a foundation model which leverages expert foundation model trained on language, vision and action data individually together to solve long-horizon tasks.
We use a large language model to construct symbolic plans that are grounded in the environment through a large video diffusion model.
Generated video plans are then grounded to visual-motor control, through an inverse dynamics model that infers actions from generated videos.
arXiv Detail & Related papers (2023-09-15T17:44:05Z) - Multimodal Procedural Planning via Dual Text-Image Prompting [78.73875275944711]
Embodied agents have achieved prominent performance in following human instructions to complete tasks.
We present the multimodal procedural planning task, in which models are given a high-level goal and generate plans of paired text-image steps.
Key challenges of MPP are to ensure the informativeness, temporal coherence, and accuracy of plans across modalities.
arXiv Detail & Related papers (2023-05-02T21:46:44Z) - MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic
Alignment [24.720485548282845]
We introduce concepts in both modalities to construct two-level semantic representations for language and vision.
We train the cross-modality model in two stages, namely, uni-modal learning and cross-modal learning.
Our model generates the-state-of-the-art results on several vision and language tasks.
arXiv Detail & Related papers (2022-01-29T14:30:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.