LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation,
Generation and Editing
- URL: http://arxiv.org/abs/2311.00571v1
- Date: Wed, 1 Nov 2023 15:13:43 GMT
- Title: LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation,
Generation and Editing
- Authors: Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, Jianfeng Gao, Chunyuan
Li
- Abstract summary: The system can have multi-turn dialogues with human users by taking multimodal user inputs and generating multimodal responses.
LLaVA-Interactive goes beyond language prompt, where visual prompt is enabled to align human intents in the interaction.
- Score: 99.80742991922992
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLaVA-Interactive is a research prototype for multimodal human-AI
interaction. The system can have multi-turn dialogues with human users by
taking multimodal user inputs and generating multimodal responses. Importantly,
LLaVA-Interactive goes beyond language prompt, where visual prompt is enabled
to align human intents in the interaction. The development of LLaVA-Interactive
is extremely cost-efficient as the system combines three multimodal skills of
pre-built AI models without additional model training: visual chat of LLaVA,
image segmentation from SEEM, as well as image generation and editing from
GLIGEN. A diverse set of application scenarios is presented to demonstrate the
promises of LLaVA-Interactive and to inspire future research in multimodal
interactive systems.
Related papers
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models [50.259006481656094]
We present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models.
Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer.
We present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.
arXiv Detail & Related papers (2024-04-03T23:57:34Z) - MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments [82.67236400004826]
We introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions.
MEM module enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities.
arXiv Detail & Related papers (2024-02-01T02:43:20Z) - LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model [20.209674713676872]
We introduce LLaVA-$phi$ (LLaVA-Phi), an efficient multi-modal assistant.
LLaVA-Phi harnesses the power of the recently advanced small language model, Phi-2.
arXiv Detail & Related papers (2024-01-04T16:07:43Z) - Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts.
This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals.
We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z) - Expanding Frozen Vision-Language Models without Retraining: Towards
Improved Robot Perception [0.0]
Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks.
In this paper, we demonstrate a method of aligning the embedding spaces of different modalities to the vision embedding space.
We show that using multiple modalities as input improves the VLM's scene understanding and enhances its overall performance in various tasks.
arXiv Detail & Related papers (2023-08-31T06:53:55Z) - ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring
Instruction Tuning [24.87615615489849]
We present precise referring instructions that utilize diverse reference representations such as points and boxes as referring prompts to refer to the special region.
We propose ChatSpot, a unified end-to-end multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes.
arXiv Detail & Related papers (2023-07-18T17:56:06Z) - Chat with the Environment: Interactive Multimodal Perception Using Large
Language Models [19.623070762485494]
Large Language Models (LLMs) have shown remarkable reasoning ability in few-shot robotic planning.
Our study demonstrates that LLMs can provide high-level planning and reasoning skills and control interactive robot behavior in a multimodal environment.
arXiv Detail & Related papers (2023-03-14T23:01:27Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.