Related papers: FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering

FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering

URL: http://arxiv.org/abs/2506.21710v1
Date: Thu, 26 Jun 2025 18:51:04 GMT
Title: FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering
Authors: Liangyu Zhong, Fabio Rosenthal, Joachim Sicking, Fabian Hüger, Thorsten Bagdonat, Hanno Gottschalk, Leo Schwinn,
Abstract summary: We propose a training-free visual cropping method, dubbed FOCUS, to guide the search for the most relevant image region.<n> FOCUS achieves strong performance across four fine-grained VQA datasets and two types of MLLMs.<n>It outperforms three popular visual cropping methods in both accuracy and efficiency, and matches the best-performing baseline, ZoomEye.
Score: 5.840924060437216
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and finally, we perform the fine-grained VQA task using the top-ranked region. As a result of this informed search strategy, FOCUS achieves strong performance across four fine-grained VQA datasets and two types of MLLMs. It outperforms three popular visual cropping methods in both accuracy and efficiency, and matches the best-performing baseline, ZoomEye, while requiring 3 - 6.5 x less compute.

Related papers

BlindSight: Harnessing Sparsity for Efficient VLMs [4.756688231351083]
We propose BlindSight: a training-free approach to optimize VLM inference using a input template-aware attention sparsity mask.<n>BlindSight results in a 32%-41% reduction in FLOPs on average with -2%-+2% accuracy compared to the original model in most evaluated multi-image understanding benchmarks.
arXiv Detail & Related papers (2025-07-11T23:15:30Z)
4th PVUW MeViS 3rd Place Report: Sa2VA [105.88675577642204]
We show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS.<n>In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos.
arXiv Detail & Related papers (2025-04-01T07:06:47Z)
Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection [53.558449071113245]
Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM)<n>Recent advancements in vision-language modeling introduce image cropping techniques that feed all encoded sub-images into the model.<n>We propose a lightweight, universal framework that seamlessly integrates with existing VLMs to enhance their ability to process finegrained details.
arXiv Detail & Related papers (2025-03-14T18:33:31Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
Probing Visual Language Priors in VLMs [51.016683265437536]
We introduce ViLP, a benchmark featuring deliberately out-of-distribution images.<n>Each question in ViLP is coupled with three potential answers and three corresponding images.<n>We propose a self-improving framework in which models generate new VQA data, then apply pixel-level and semantic corruptions to form "good-bad" image pairs for self-training.
arXiv Detail & Related papers (2024-12-31T17:54:29Z)
Distill Visual Chart Reasoning Ability from LLMs to MLLMs [38.62832112530892]
Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs) We propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs. We employ text-based synthesizing techniques to construct chart-plotting code and produce ReachQA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs.
arXiv Detail & Related papers (2024-10-24T14:50:42Z)
Attention Prompting on Image for Large Vision-Language Models [63.794304207664176]
We propose a new prompting technique named Attention Prompting on Image. We generate an attention heatmap for the input image dependent on the text query with an auxiliary model like CLIP. Experiments on various vison-language benchmarks verify the effectiveness of our technique.
arXiv Detail & Related papers (2024-09-25T17:59:13Z)
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC) This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z)
Towards Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal LLMs [12.598351373932234]
We investigate whether MLLMs can perceive small details as well as large details in images. We show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject of the question. We propose five automatic visual cropping methods to improve the zero-shot performance of MLLMs.
arXiv Detail & Related papers (2023-10-24T17:48:04Z)
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs. We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets. We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z)
LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images. We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information. Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.