Multimodal Procedural Planning via Dual Text-Image Prompting
- URL: http://arxiv.org/abs/2305.01795v1
- Date: Tue, 2 May 2023 21:46:44 GMT
- Title: Multimodal Procedural Planning via Dual Text-Image Prompting
- Authors: Yujie Lu, Pan Lu, Zhiyu Chen, Wanrong Zhu, Xin Eric Wang, William Yang
Wang
- Abstract summary: Embodied agents have achieved prominent performance in following human instructions to complete tasks.
We present the multimodal procedural planning task, in which models are given a high-level goal and generate plans of paired text-image steps.
Key challenges of MPP are to ensure the informativeness, temporal coherence, and accuracy of plans across modalities.
- Score: 78.73875275944711
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Embodied agents have achieved prominent performance in following human
instructions to complete tasks. However, the potential of providing
instructions informed by texts and images to assist humans in completing tasks
remains underexplored. To uncover this capability, we present the multimodal
procedural planning (MPP) task, in which models are given a high-level goal and
generate plans of paired text-image steps, providing more complementary and
informative guidance than unimodal plans. The key challenges of MPP are to
ensure the informativeness, temporal coherence,and accuracy of plans across
modalities. To tackle this, we propose Text-Image Prompting (TIP), a
dual-modality prompting method that jointly leverages zero-shot reasoning
ability in large language models (LLMs) and compelling text-to-image generation
ability from diffusion-based models. TIP improves the interaction in the dual
modalities using Text-to-Image Bridge and Image-to-Text Bridge, allowing LLMs
to guide the textual-grounded image plan generation and leveraging the
descriptions of image plans to ground the textual plan reversely. To address
the lack of relevant datasets, we collect WIKIPLAN and RECIPEPLAN as a testbed
for MPP. Our results show compelling human preferences and automatic scores
against unimodal and multimodal baselines on WIKIPLAN and RECIPEPLAN in terms
of informativeness, temporal coherence, and plan accuracy. Our code and data:
https://github.com/YujieLu10/MPP.
Related papers
- Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - Show and Guide: Instructional-Plan Grounded Vision and Language Model [9.84151565227816]
We present MM-PlanLLM, the first multimodal plan-following language model.
We bring cross-modality through two key tasks: Conversational Video Moment Retrieval and Visually-Informed Step Generation.
MM-PlanLLM is trained using a novel multitask-multistage approach.
arXiv Detail & Related papers (2024-09-27T18:20:24Z) - InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices.
Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z) - TextCLIP: Text-Guided Face Image Generation And Manipulation Without
Adversarial Training [5.239585892767183]
We propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training.
Our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.
arXiv Detail & Related papers (2023-09-21T09:34:20Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - Towards Unifying Medical Vision-and-Language Pre-training via Soft
Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used.
We propose an effective yet straightforward scheme named PTUnifier to unify the two types.
We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Visual Grounding Strategies for Text-Only Natural Language Processing [1.2183405753834562]
multimodal extensions of BERT allow a joint modeling of texts and images that lead to state-of-the-art results on multimodal tasks such as Visual Question Answering.
Here, we leverage multimodal modeling for purely textual tasks with the expectation that the multimodal pretraining provides a grounding that can improve text processing accuracy.
A first type of strategy, referred to as it transferred grounding consists in applying multimodal models to text-only tasks using a placeholder to replace image input.
The second one, which we call it associative grounding, harnesses image retrieval to match texts with related images during both
arXiv Detail & Related papers (2021-03-25T16:03:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.