Zoom-Refine: Boosting High-Resolution Multimodal Understanding via Localized Zoom and Self-Refinement
- URL: http://arxiv.org/abs/2506.01663v2
- Date: Mon, 11 Aug 2025 07:25:46 GMT
- Title: Zoom-Refine: Boosting High-Resolution Multimodal Understanding via Localized Zoom and Self-Refinement
- Authors: Xuan Yu, Dayan Guan, Yanfeng Gu,
- Abstract summary: Multimodal Large Language Models (MLLM) often struggle to interpret high-resolution images accurately.<n>We introduce Zoom-Refine, a novel training-free method that enhances MLLM capabilities to address this issue.<n>Our method harnesses the MLLM's inherent capabilities for spatial localization, contextual reasoning and comparative analysis without requiring additional training or external experts.
- Score: 17.824841346088903
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Multimodal Large Language Models (MLLM) often struggle to interpret high-resolution images accurately, where fine-grained details are crucial for complex visual understanding. We introduce Zoom-Refine, a novel training-free method that enhances MLLM capabilities to address this issue. Zoom-Refine operates through a synergistic process of \textit{Localized Zoom} and \textit{Self-Refinement}. In the \textit{Localized Zoom} step, Zoom-Refine leverages the MLLM to provide a preliminary response to an input query and identifies the most task-relevant image region by predicting its bounding box coordinates. During the \textit{Self-Refinement} step, Zoom-Refine then integrates fine-grained details from the high-resolution crop (identified by \textit{Localized Zoom}) with its initial reasoning to re-evaluate and refine its preliminary response. Our method harnesses the MLLM's inherent capabilities for spatial localization, contextual reasoning and comparative analysis without requiring additional training or external experts. Comprehensive experiments demonstrate the efficacy of Zoom-Refine on two challenging high-resolution multimodal benchmarks. Code is available at \href{https://github.com/xavier-yu114/Zoom-Refine}{\color{magenta}github.com/xavier-yu114/Zoom-Refine}
Related papers
- GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery [69.05066425853326]
"thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools.<n>This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny.<n>We propose GeoEyes, a training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom
arXiv Detail & Related papers (2026-02-15T15:50:55Z) - Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception [43.08943307183693]
Region-to-Image Distillation transforms zooming from an inference-time tool into a training-time primitive.<n>We show that our models achieve leading performance across multiple fine-grained perception benchmarks.
arXiv Detail & Related papers (2026-02-12T12:00:35Z) - GRASP: Guided Region-Aware Sparse Prompting for Adapting MLLMs to Remote Sensing [50.961694646995376]
We propose a parameter-efficient fine-tuning (PEFT) strategy called Guided Region-Aware Sparse Prompting (GRASP)<n>GRASP introduces spatially structured soft prompts associated with spatial blocks extracted from a frozen visual token grid.<n>Experiments on multiple RSVQA benchmarks show that GRASP achieves competitive performance compared to existing fine-tuning and prompt-based methods.
arXiv Detail & Related papers (2026-01-23T10:12:59Z) - ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks [49.99788276124186]
Existing dynamic resolution and token pruning methods are constrained by a passive perception paradigm.<n>We present LRS-GRO, a large-scale benchmark dataset tailored for active perception in UHR RS processing.<n>We propose ZoomEarth, an adaptive cropping-zooming framework with a novel Region-Guided reward that provides fine-grained guidance.
arXiv Detail & Related papers (2025-11-15T15:47:46Z) - FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning [62.11389260206383]
textscFineRS is a two-stage MLLM-based reinforcement learning framework for segmenting extremely small objects.<n>We present textscFineRS-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets.
arXiv Detail & Related papers (2025-10-24T10:14:17Z) - Spatial Preference Rewarding for MLLMs Spatial Understanding [92.25703021388142]
Multimodal large language models (MLLMs) have demonstrated promising spatial understanding capabilities.<n>Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities.<n>We propose a Spatial Preference Rewarding(SPR) approach that enhances MLLMs' spatial capabilities.
arXiv Detail & Related papers (2025-10-16T07:16:18Z) - HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling [22.105148012267005]
HiDe is a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens.<n>It reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference.<n>After optimization, HiDe uses 75% less memory than the previous training-free approach.
arXiv Detail & Related papers (2025-09-28T08:31:48Z) - Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment [51.99765487172328]
Chain-of-zoom (CoZ) is a framework that factorizes SISR into a chain of intermediate scale-states with multi-scale-aware prompts.<n>Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM)<n>Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity.
arXiv Detail & Related papers (2025-05-24T08:50:08Z) - VLM-R$^3$: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought [51.43082554363725]
We introduce textbfVLM-R$3$ (textbfVisual textbfLanguage textbfModel with textbfRegion textbfRecognition and textbfReasoning), a framework that equips an MLLM with the ability to decide emph when additional visual evidence is needed.<n>Experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R$3$ sets a new
arXiv Detail & Related papers (2025-05-22T03:50:13Z) - XeMap: Contextual Referring in Large-Scale Remote Sensing Environments [13.162347922111056]
XeMap task focuses on contextual, fine-grained localization of text-referred regions in large-scale RS scenes.<n>XeMap-Network handles the complexities of pixel-level cross-modal contextual referring mapping in RS.<n>HMSA module aligns multiscale visual features with the text semantic vector, enabling precise multimodal matching.
arXiv Detail & Related papers (2025-04-30T02:14:39Z) - EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery [15.581788175591097]
It is challenging to adapt natural spatial models to remote sensing imagery.<n>EarthGPT-X offers zoom-in and zoom-out insight, and possesses flexible multi-grained interactive abilities.<n>Experiments conducted demonstrate the superiority of the proposed EarthGPT-X in multi-grained tasks.
arXiv Detail & Related papers (2025-04-17T09:56:35Z) - Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception [10.377899615199278]
High-quality image captions play a crucial role in improving the performance of cross-modal applications.<n>Recent studies have employed multimodal large language models (MLLMs) to generate captions.<n>However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations.
arXiv Detail & Related papers (2025-04-09T08:07:46Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness [34.170341753045776]
We introduce DLaVA, a novel method that enhances MLLMs with answer localization capabilities for Document VQA.<n>We present both OCR-dependent and OCR-free architectures, with the OCR-free approach eliminating the need for separate text recognition components.<n>Our contributions include enhancing interpretability and reliability by grounding responses in spatially annotated visual content.
arXiv Detail & Related papers (2024-11-29T06:17:11Z) - ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration [33.675976247869016]
Zoom Eye conceptualizes an image as a tree, with each children node representing a zoomed sub-patch of the parent node and the root represents the overall image.
We show that Zoom Eye consistently improves the performance of a series base MLLMs with large margin(e.g., LLaVA-v1.5-7B increases by 34.57% on $V*$ Bench and 17.88% on HR-Bench), but also enables small 7B MLLMs to outperform strong large models such as GPT-4o.
arXiv Detail & Related papers (2024-11-25T02:15:30Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - TokenPacker: Efficient Visual Projector for Multimodal LLM [37.1071749188282]
The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM)
We propose a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens.
Our approach compresses the visual tokens by 75%89%, while achieves comparable or even better performance across diverse benchmarks.
arXiv Detail & Related papers (2024-07-02T16:10:55Z) - Towards Robust Scene Text Image Super-resolution via Explicit Location
Enhancement [59.66539728681453]
Scene text image super-resolution (STISR) aims to improve image quality while boosting downstream scene text recognition accuracy.
Most existing methods treat the foreground (character regions) and background (non-character regions) equally in the forward process.
We propose a novel method LEMMA that explicitly models character regions to produce high-level text-specific guidance for super-resolution.
arXiv Detail & Related papers (2023-07-19T05:08:47Z) - Boosting Few-shot Fine-grained Recognition with Background Suppression
and Foreground Alignment [53.401889855278704]
Few-shot fine-grained recognition (FS-FGR) aims to recognize novel fine-grained categories with the help of limited available samples.
We propose a two-stage background suppression and foreground alignment framework, which is composed of a background activation suppression (BAS) module, a foreground object alignment (FOA) module, and a local to local (L2L) similarity metric.
Experiments conducted on multiple popular fine-grained benchmarks demonstrate that our method outperforms the existing state-of-the-art by a large margin.
arXiv Detail & Related papers (2022-10-04T07:54:40Z) - High-resolution Depth Maps Imaging via Attention-based Hierarchical
Multi-modal Fusion [84.24973877109181]
We propose a novel attention-based hierarchical multi-modal fusion network for guided DSR.
We show that our approach outperforms state-of-the-art methods in terms of reconstruction accuracy, running speed and memory efficiency.
arXiv Detail & Related papers (2021-04-04T03:28:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.