LMEye: An Interactive Perception Network for Large Language Models
- URL: http://arxiv.org/abs/2305.03701v6
- Date: Thu, 28 Sep 2023 08:18:43 GMT
- Title: LMEye: An Interactive Perception Network for Large Language Models
- Authors: Yunxin Li, Baotian Hu, Xinyu Chen, Lin Ma, Yong Xu, and Min Zhang
- Abstract summary: LMEye is a human-like eye with a play-and-plug interactive perception network.
It enables dynamic interaction between Large Language Models and external vision information.
It significantly improves the zero-shot performance on various multimodal tasks.
- Score: 43.160353427015025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training a Multimodal Large Language Model (MLLM) from scratch, like GPT-4,
is resource-intensive. Regarding Large Language Models (LLMs) as the core
processor for multimodal information, our paper introduces LMEye, a human-like
eye with a play-and-plug interactive perception network, designed to enable
dynamic interaction between LLMs and external vision information. Previous
methods incorporate visual information into LLMs with a simple visual mapping
network or Q-former from BLIP-2. Such networks project the image feature once
yet do not consider the interaction between the image and the human input
query. Hence, the obtained visual information without being connected to human
intention may be inadequate for LLMs to generate intention-following responses,
which we refer to as static visual information. LMEye addresses this issue by
allowing the LLM to request the desired visual information aligned with various
human instructions, which we term as the dynamic visual information
interaction. Specifically, LMEye consists of a simple visual mapping network to
provide the basic perception of an image for LLMs. It also contains additional
modules responsible for acquiring requests from LLMs, performing request-based
visual information interaction, and transmitting the resulting interacted
visual information to LLMs, respectively. In this way, LLMs act to understand
the human query, deliver the corresponding request to the request-based visual
information interaction module, and generate the response based on the
interleaved multimodal information. We evaluate LMEye through extensive
experiments on some multimodal benchmarks, demonstrating that it significantly
improves the zero-shot performance on various multimodal tasks compared to
previous methods, with less parameters.
Related papers
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Visual Prompting in Multimodal Large Language Models: A Survey [95.75225825537528]
Multimodal large language models (MLLMs) equip pre-trained large-language models (LLMs) with visual capabilities.
Visual prompting has emerged for more fine-grained and free-form visual instructions.
This paper focuses on visual prompting, prompt generation, compositional reasoning, and prompt learning.
arXiv Detail & Related papers (2024-09-05T08:47:34Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - From Image to Video, what do we need in multimodal LLMs? [19.85928004619801]
Multimodal Large Language Models (MLLMs) have demonstrated profound capabilities in understanding multimodal information.
We propose RED-VILLM, a Resource-Efficient Development pipeline for Video LLMs from Image LLMs.
Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models.
arXiv Detail & Related papers (2024-04-18T02:43:37Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs [34.211455081027964]
V* is a visual search mechanism that employs the world knowledge in LLMs for efficient visual querying.
Our study highlights the necessity of incorporating visual search capabilities into multimodal systems.
arXiv Detail & Related papers (2023-12-21T18:55:06Z) - InfMLLM: A Unified Framework for Visual-Language Tasks [44.29407348046122]
multimodal large language models (MLLMs) have attracted growing interest.
This work delves into enabling LLMs to tackle more vision-language-related tasks.
InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs.
arXiv Detail & Related papers (2023-11-12T09:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.