Related papers: A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images

A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images

URL: http://arxiv.org/abs/2507.10202v1
Date: Mon, 14 Jul 2025 12:14:53 GMT
Title: A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images
Authors: Jaeseong Lee, Yeeun Choi, Heechan Choi, Hanjung Kim, Seonjoo Kim,
Abstract summary: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding, reasoning, and generation.<n>They struggle with tasks requiring fine-grained localization and reasoning in high-resolution images.<n>We propose Extract Candidate then Predict (ECP), a novel training-free, task-agnostic two-stage framework designed to enhance MLLM performance on high-resolution images.
Score: 19.549498712690404
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding, reasoning, and generation. However, they struggle with tasks requiring fine-grained localization and reasoning in high-resolution images. This constraint stems from the fact that MLLMs are fine-tuned with fixed image resolution to align with the pre-trained image encoder used in MLLM. Consequently, feeding high-resolution images directly into MLLMs leads to poor generalization due to a train-test resolution discrepancy, while downsampling these images-although ensuring consistency-compromises fine-grained visual details and ultimately degrades performance. To address this challenge, we propose Extract Candidate then Predict (ECP), a novel training-free, task-agnostic two-stage framework designed to enhance MLLM performance on high-resolution images. The key intuition behind ECP is that while MLLMs struggle with high-resolution images, their predictions on downsampled images still contain implicit localization cues. By first identifying candidate region using the coarse prediction and then predicting the final output based on candidate region, ECP effectively preserves fine-grained details while mitigating the challenges posed by high-resolution data. We validate our framework on 4K GUI grounding and 4K, 8K MLLM perception, achieving +21.3%, +5.8%, +5.2% absolute improvement compared to baseline respectively, demonstrating its effectiveness. Code is available at https://github.com/yenncye/ECP.

Related papers

Demystifying the Visual Quality Paradox in Multimodal Large Language Models [49.154146792279946]
Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses.<n>We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks.<n>We uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity.
arXiv Detail & Related papers (2025-06-18T17:14:07Z)
MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval [50.062817677022586]
Zero-Shot Image Retrieval (ZS-CIR) methods typically train adapters that convert reference images into pseudo-text tokens.<n>We propose MLLM-Guided VLM Fine-Tuning with Joint Inference (MVFT-JI) to construct two complementary training tasks using only unlabeled images.
arXiv Detail & Related papers (2025-05-26T08:56:59Z)
Dynamic Pyramid Network for Efficient Multimodal Large Language Model [11.864416286283399]
Multimodal large language models (MLLMs) have demonstrated impressive performance in various vision-language (VL) tasks.<n>Recent efforts aim to compress the visual features to save the computational costs of MLLMs.<n>We propose a novel dynamic pyramid network (DPN) for efficient MLLMs.
arXiv Detail & Related papers (2025-03-26T08:44:11Z)
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage [50.84150600032693]
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations.<n>We propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions.<n>Our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V.
arXiv Detail & Related papers (2024-12-20T01:37:22Z)
PETALface: Parameter Efficient Transfer Learning for Low-resolution Face Recognition [54.642714288448744]
PETALface is the first work leveraging the powers of PEFT for low resolution face recognition.<n>We introduce two low-rank adaptation modules to the backbone, with weights adjusted based on the input image quality to account for the difference in quality for the gallery and probe images.<n>Experiments demonstrate that the proposed method outperforms full fine-tuning on low-resolution datasets while preserving performance on high-resolution and mixed-quality datasets.
arXiv Detail & Related papers (2024-12-10T18:59:45Z)
Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models [57.280853324896306]
Multimodal large language models (MLLMs) struggle to recognize and interpret intricate details in high-resolution (HR) images effectively. We introduce HR-Bench, the first deliberately designed benchmark to rigorously evaluate MLLM performance on 4K&8K images. We propose Divide, Conquer and Combine (DC$2$), a novel training-free framework for enhancing MLLM perception of HR images.
arXiv Detail & Related papers (2024-08-28T06:09:02Z)
Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance [51.30560006045442]
Image-gRounded guIdaNcE (MARINE) is a framework that is both training-free and API-free.<n>MARINE effectively and efficiently reduces object hallucinations during inference by introducing image-grounded guidance to LVLMs.<n>Our framework's flexibility further allows for the integration of multiple vision models, enabling more reliable and robust object-level guidance.
arXiv Detail & Related papers (2024-02-13T18:59:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.