V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
- URL: http://arxiv.org/abs/2312.14135v2
- Date: Tue, 26 Dec 2023 15:20:45 GMT
- Title: V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
- Authors: Penghao Wu, Saining Xie
- Abstract summary: V* is a visual search mechanism that employs the world knowledge in LLMs for efficient visual querying.
Our study highlights the necessity of incorporating visual search capabilities into multimodal systems.
- Score: 34.211455081027964
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When we look around and perform complex tasks, how we see and selectively
process what we see is crucial. However, the lack of this visual search
mechanism in current multimodal LLMs (MLLMs) hinders their ability to focus on
important visual details, especially when handling high-resolution and visually
crowded images. To address this, we introduce V*, an LLM-guided visual search
mechanism that employs the world knowledge in LLMs for efficient visual
querying. When combined with an MLLM, this mechanism enhances collaborative
reasoning, contextual understanding, and precise targeting of specific visual
elements. This integration results in a new MLLM meta-architecture, named Show,
sEArch, and TelL (SEAL). We further create V*Bench, a benchmark specifically
designed to evaluate MLLMs in their ability to process high-resolution images
and focus on visual details. Our study highlights the necessity of
incorporating visual search capabilities into multimodal systems. The code is
available https://github.com/penghao-wu/vstar.
Related papers
- MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders [89.38717274524681]
This study explores the design space for multimodal large language models (MLLMs) using a mixture of vision encoders and resolutions.
Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach.
The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.
arXiv Detail & Related papers (2024-08-28T17:59:31Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Tell Me Where You Are: Multimodal LLMs Meet Place Recognition [11.421492098416538]
We introduce multimodal large language models (MLLMs) to visual place recognition (VPR)
Our key design is to use vision-based retrieval to propose several candidates and then leverage language-based reasoning to carefully inspect each candidate for a final decision.
Our results on three datasets demonstrate that integrating the general-purpose visual features from VFMs with the reasoning capabilities of MLLMs already provides an effective place recognition solution.
arXiv Detail & Related papers (2024-06-25T12:59:46Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks [89.24440488456405]
VisionLLM v2 is an end-to-end generalist multimodal large model (MLLM)
It unifies visual perception, understanding, and generation within a single framework.
arXiv Detail & Related papers (2024-06-12T16:44:50Z) - Visual Question Answering Instruction: Unlocking Multimodal Large
Language Model To Domain-Specific Visual Multitasks [0.8192907805418583]
We develop a method to transform domain-specific visual and vision-language datasets into a unified question answering format called Visual Question Answering Instruction (VQA-IN)
The proposed method achieved a high score metric on domainspecific visual tasks while also maintaining its performance on vision-language tasks in a multitask manner.
arXiv Detail & Related papers (2024-02-13T10:40:53Z) - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [50.77984109941538]
Our research reveals that the visual capabilities in recent multimodal LLMs still exhibit systematic shortcomings.
We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences.
We evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs.
arXiv Detail & Related papers (2024-01-11T18:58:36Z) - From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language
Models [36.41816380074965]
We investigate the effectiveness of different vision encoders within Large Language Models (MLLMs)
Our findings reveal that the shallow layer features of CLIP offer particular advantages for fine-grained tasks such as grounding and region understanding.
We propose a simple yet effective feature merging strategy, named COMM, that integrates CLIP and DINO with Multi-level features Merging.
arXiv Detail & Related papers (2023-10-13T02:41:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.