MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
- URL: http://arxiv.org/abs/2502.17422v1
- Date: Mon, 24 Feb 2025 18:54:40 GMT
- Title: MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
- Authors: Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski,
- Abstract summary: We study whether MLLMs can perceive small visual details as effectively as large ones when answering questions about images.<n>We propose training-free visual intervention methods that leverage the internal knowledge of any MLLM itself.
- Score: 11.532430076027554
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) have experienced rapid progress in visual recognition tasks in recent years. Given their potential integration into many critical applications, it is important to understand the limitations of their visual perception. In this work, we study whether MLLMs can perceive small visual details as effectively as large ones when answering questions about images. We observe that their performance is very sensitive to the size of the visual subject of the question, and further show that this effect is in fact causal by conducting an intervention study. Next, we study the attention patterns of MLLMs when answering visual questions, and intriguingly find that they consistently know where to look, even when they provide the wrong answer. Based on these findings, we then propose training-free visual intervention methods that leverage the internal knowledge of any MLLM itself, in the form of attention and gradient maps, to enhance its perception of small visual details. We evaluate our proposed methods on two widely-used MLLMs and seven visual question answering benchmarks and show that they can significantly improve MLLMs' accuracy without requiring any training. Our results elucidate the risk of applying MLLMs to visual recognition tasks concerning small details and indicate that visual intervention using the model's internal state is a promising direction to mitigate this risk.
Related papers
- Evaluating Graphical Perception with Multimodal LLMs [2.090547583226381]
Multimodal Large Language Models (MLLMs) have remarkably progressed in analyzing and understanding images.
For visualization, how do MLLMs perform when applied to graphical perception tasks?
Our study primarily evaluates fine-tuned and pretrained models and zero-shot prompting to determine if they closely match human graphical perception.
arXiv Detail & Related papers (2025-04-05T16:14:08Z) - Grounded Chain-of-Thought for Multimodal Large Language Models [66.04061083611863]
We propose a new learning task for multimodal large language models (MLLMs) called Grounded Chain-of-Thought (GCoT)
GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis.
To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images.
arXiv Detail & Related papers (2025-03-17T04:07:47Z) - Explore the Hallucination on Low-level Perception for MLLMs [83.12180878559295]
We aim to define and evaluate the self-awareness of MLLMs in low-level visual perception and understanding tasks.
We present QL-Bench, a benchmark settings to simulate human responses to low-level vision.
We demonstrate that while some models exhibit robust low-level visual capabilities, their self-awareness remains relatively underdeveloped.
arXiv Detail & Related papers (2024-09-15T14:38:29Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Exploring Perceptual Limitation of Multimodal Large Language Models [57.567868157293994]
We quantitatively study the perception of small visual objects in several state-of-the-art MLLMs.
We identify four independent factors that can contribute to this limitation.
Lower object quality and smaller object size can both independently reduce MLLMs' ability to answer visual questions.
arXiv Detail & Related papers (2024-02-12T03:04:42Z) - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [50.77984109941538]
Our research reveals that the visual capabilities in recent multimodal LLMs still exhibit systematic shortcomings.
We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences.
We evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs.
arXiv Detail & Related papers (2024-01-11T18:58:36Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z) - V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs [34.211455081027964]
V* is a visual search mechanism that employs the world knowledge in LLMs for efficient visual querying.
Our study highlights the necessity of incorporating visual search capabilities into multimodal systems.
arXiv Detail & Related papers (2023-12-21T18:55:06Z) - Towards Perceiving Small Visual Details in Zero-shot Visual Question
Answering with Multimodal LLMs [12.598351373932234]
We investigate whether MLLMs can perceive small details as well as large details in images.
We show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject of the question.
We propose five automatic visual cropping methods to improve the zero-shot performance of MLLMs.
arXiv Detail & Related papers (2023-10-24T17:48:04Z) - From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language
Models [36.41816380074965]
We investigate the effectiveness of different vision encoders within Large Language Models (MLLMs)
Our findings reveal that the shallow layer features of CLIP offer particular advantages for fine-grained tasks such as grounding and region understanding.
We propose a simple yet effective feature merging strategy, named COMM, that integrates CLIP and DINO with Multi-level features Merging.
arXiv Detail & Related papers (2023-10-13T02:41:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.