LMEye: An Interactive Perception Network for Large Language Models
- URL: http://arxiv.org/abs/2305.03701v6
- Date: Thu, 28 Sep 2023 08:18:43 GMT
- Title: LMEye: An Interactive Perception Network for Large Language Models
- Authors: Yunxin Li, Baotian Hu, Xinyu Chen, Lin Ma, Yong Xu, and Min Zhang
- Abstract summary: LMEye is a human-like eye with a play-and-plug interactive perception network.
It enables dynamic interaction between Large Language Models and external vision information.
It significantly improves the zero-shot performance on various multimodal tasks.
- Score: 43.160353427015025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training a Multimodal Large Language Model (MLLM) from scratch, like GPT-4,
is resource-intensive. Regarding Large Language Models (LLMs) as the core
processor for multimodal information, our paper introduces LMEye, a human-like
eye with a play-and-plug interactive perception network, designed to enable
dynamic interaction between LLMs and external vision information. Previous
methods incorporate visual information into LLMs with a simple visual mapping
network or Q-former from BLIP-2. Such networks project the image feature once
yet do not consider the interaction between the image and the human input
query. Hence, the obtained visual information without being connected to human
intention may be inadequate for LLMs to generate intention-following responses,
which we refer to as static visual information. LMEye addresses this issue by
allowing the LLM to request the desired visual information aligned with various
human instructions, which we term as the dynamic visual information
interaction. Specifically, LMEye consists of a simple visual mapping network to
provide the basic perception of an image for LLMs. It also contains additional
modules responsible for acquiring requests from LLMs, performing request-based
visual information interaction, and transmitting the resulting interacted
visual information to LLMs, respectively. In this way, LLMs act to understand
the human query, deliver the corresponding request to the request-based visual
information interaction module, and generate the response based on the
interleaved multimodal information. We evaluate LMEye through extensive
experiments on some multimodal benchmarks, demonstrating that it significantly
improves the zero-shot performance on various multimodal tasks compared to
previous methods, with less parameters.
Related papers
- Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - From Image to Video, what do we need in multimodal LLMs? [19.85928004619801]
Multimodal Large Language Models (MLLMs) have demonstrated profound capabilities in understanding multimodal information.
We propose RED-VILLM, a Resource-Efficient Development pipeline for Video LLMs from Image LLMs.
Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models.
arXiv Detail & Related papers (2024-04-18T02:43:37Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks.
However, they fall short to comprehend context involving multiple images.
We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z) - V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs [34.211455081027964]
V* is a visual search mechanism that employs the world knowledge in LLMs for efficient visual querying.
Our study highlights the necessity of incorporating visual search capabilities into multimodal systems.
arXiv Detail & Related papers (2023-12-21T18:55:06Z) - InfMLLM: A Unified Framework for Visual-Language Tasks [44.29407348046122]
multimodal large language models (MLLMs) have attracted growing interest.
This work delves into enabling LLMs to tackle more vision-language-related tasks.
InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs.
arXiv Detail & Related papers (2023-11-12T09:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.