Related papers: Zoom-Refine: Boosting High-Resolution Multimodal Understanding via Localized Zoom and Self-Refinement

Related papers

GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery [69.05066425853326]
"thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools.<n>This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny.<n>We propose GeoEyes, a training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom
arXiv Detail & Related papers (2026-02-15T15:50:55Z)
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception [43.08943307183693]
Region-to-Image Distillation transforms zooming from an inference-time tool into a training-time primitive.<n>We show that our models achieve leading performance across multiple fine-grained perception benchmarks.
arXiv Detail & Related papers (2026-02-12T12:00:35Z)
GRASP: Guided Region-Aware Sparse Prompting for Adapting MLLMs to Remote Sensing [50.961694646995376]
We propose a parameter-efficient fine-tuning (PEFT) strategy called Guided Region-Aware Sparse Prompting (GRASP)<n>GRASP introduces spatially structured soft prompts associated with spatial blocks extracted from a frozen visual token grid.<n>Experiments on multiple RSVQA benchmarks show that GRASP achieves competitive performance compared to existing fine-tuning and prompt-based methods.
arXiv Detail & Related papers (2026-01-23T10:12:59Z)
ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks [49.99788276124186]
Existing dynamic resolution and token pruning methods are constrained by a passive perception paradigm.<n>We present LRS-GRO, a large-scale benchmark dataset tailored for active perception in UHR RS processing.<n>We propose ZoomEarth, an adaptive cropping-zooming framework with a novel Region-Guided reward that provides fine-grained guidance.
arXiv Detail & Related papers (2025-11-15T15:47:46Z)
FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning [62.11389260206383]
textscFineRS is a two-stage MLLM-based reinforcement learning framework for segmenting extremely small objects.<n>We present textscFineRS-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets.
arXiv Detail & Related papers (2025-10-24T10:14:17Z)
Spatial Preference Rewarding for MLLMs Spatial Understanding [92.25703021388142]
Multimodal large language models (MLLMs) have demonstrated promising spatial understanding capabilities.<n>Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities.<n>We propose a Spatial Preference Rewarding(SPR) approach that enhances MLLMs' spatial capabilities.
arXiv Detail & Related papers (2025-10-16T07:16:18Z)
HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling [22.105148012267005]
HiDe is a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens.<n>It reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference.<n>After optimization, HiDe uses 75% less memory than the previous training-free approach.
arXiv Detail & Related papers (2025-09-28T08:31:48Z)
Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment [51.99765487172328]
Chain-of-zoom (CoZ) is a framework that factorizes SISR into a chain of intermediate scale-states with multi-scale-aware prompts.<n>Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM)<n>Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity.
arXiv Detail & Related papers (2025-05-24T08:50:08Z)
VLM-R$^3$: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought [51.43082554363725]
We introduce textbfVLM-R$3$ (textbfVisual textbfLanguage textbfModel with textbfRegion textbfRecognition and textbfReasoning), a framework that equips an MLLM with the ability to decide emph when additional visual evidence is needed.<n>Experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R$3$ sets a new
arXiv Detail & Related papers (2025-05-22T03:50:13Z)
XeMap: Contextual Referring in Large-Scale Remote Sensing Environments [13.162347922111056]
XeMap task focuses on contextual, fine-grained localization of text-referred regions in large-scale RS scenes.<n>XeMap-Network handles the complexities of pixel-level cross-modal contextual referring mapping in RS.<n>HMSA module aligns multiscale visual features with the text semantic vector, enabling precise multimodal matching.
arXiv Detail & Related papers (2025-04-30T02:14:39Z)
EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery [15.581788175591097]
It is challenging to adapt natural spatial models to remote sensing imagery.<n>EarthGPT-X offers zoom-in and zoom-out insight, and possesses flexible multi-grained interactive abilities.<n>Experiments conducted demonstrate the superiority of the proposed EarthGPT-X in multi-grained tasks.
arXiv Detail & Related papers (2025-04-17T09:56:35Z)
Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception [10.377899615199278]
High-quality image captions play a crucial role in improving the performance of cross-modal applications.<n>Recent studies have employed multimodal large language models (MLLMs) to generate captions.<n>However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations.
arXiv Detail & Related papers (2025-04-09T08:07:46Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness [34.170341753045776]
We introduce DLaVA, a novel method that enhances MLLMs with answer localization capabilities for Document VQA.<n>We present both OCR-dependent and OCR-free architectures, with the OCR-free approach eliminating the need for separate text recognition components.<n>Our contributions include enhancing interpretability and reliability by grounding responses in spatially annotated visual content.
arXiv Detail & Related papers (2024-11-29T06:17:11Z)
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration [33.675976247869016]
Zoom Eye conceptualizes an image as a tree, with each children node representing a zoomed sub-patch of the parent node and the root represents the overall image. We show that Zoom Eye consistently improves the performance of a series base MLLMs with large margin(e.g., LLaVA-v1.5-7B increases by 34.57% on $V*$ Bench and 17.88% on HR-Bench), but also enables small 7B MLLMs to outperform strong large models such as GPT-4o.
arXiv Detail & Related papers (2024-11-25T02:15:30Z)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs. This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z)
TokenPacker: Efficient Visual Projector for Multimodal LLM [37.1071749188282]
The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) We propose a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens. Our approach compresses the visual tokens by 75%89%, while achieves comparable or even better performance across diverse benchmarks.
arXiv Detail & Related papers (2024-07-02T16:10:55Z)
Towards Robust Scene Text Image Super-resolution via Explicit Location Enhancement [59.66539728681453]
Scene text image super-resolution (STISR) aims to improve image quality while boosting downstream scene text recognition accuracy. Most existing methods treat the foreground (character regions) and background (non-character regions) equally in the forward process. We propose a novel method LEMMA that explicitly models character regions to produce high-level text-specific guidance for super-resolution.
arXiv Detail & Related papers (2023-07-19T05:08:47Z)
Boosting Few-shot Fine-grained Recognition with Background Suppression and Foreground Alignment [53.401889855278704]
Few-shot fine-grained recognition (FS-FGR) aims to recognize novel fine-grained categories with the help of limited available samples. We propose a two-stage background suppression and foreground alignment framework, which is composed of a background activation suppression (BAS) module, a foreground object alignment (FOA) module, and a local to local (L2L) similarity metric. Experiments conducted on multiple popular fine-grained benchmarks demonstrate that our method outperforms the existing state-of-the-art by a large margin.
arXiv Detail & Related papers (2022-10-04T07:54:40Z)
High-resolution Depth Maps Imaging via Attention-based Hierarchical Multi-modal Fusion [84.24973877109181]
We propose a novel attention-based hierarchical multi-modal fusion network for guided DSR. We show that our approach outperforms state-of-the-art methods in terms of reconstruction accuracy, running speed and memory efficiency.
arXiv Detail & Related papers (2021-04-04T03:28:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.