Related papers: HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

URL: http://arxiv.org/abs/2510.00054v1
Date: Sun, 28 Sep 2025 08:31:48 GMT
Title: HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling
Authors: Xianjie Liu, Yiman Hu, Yixiong Zou, Liang Wu, Jian Xu, Bo Zheng,
Abstract summary: HiDe is a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens.<n>It reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference.<n>After optimization, HiDe uses 75% less memory than the previous training-free approach.
Score: 22.105148012267005
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding tasks. However, their performance on high-resolution images remains suboptimal. While existing approaches often attribute this limitation to perceptual constraints and argue that MLLMs struggle to recognize small objects, leading them to use "zoom in" strategies for better detail, our analysis reveals a different cause: the main issue is not object size, but rather caused by complex background interference. We systematically analyze this "zoom in" operation through a series of decoupling experiments and propose the Hierarchical Decoupling Framework (HiDe), a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens, then leverages their attention weights to achieve precise alignment with the target visual regions. Subsequently, it employs Layout-Preserving Decoupling (LPD) to decouple these regions from the background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference. HiDe sets a new SOTA on V*Bench, HRBench4K, and HRBench8K, boosting Qwen2.5-VL 7B and InternVL3 8B to SOTA (92.1% and 91.6% on V*Bench), even surpassing RL methods. After optimization, HiDe uses 75% less memory than the previous training-free approach. Code is provided in https://github.com/Tennine2077/HiDe.

Related papers

GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery [69.05066425853326]
"thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools.<n>This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny.<n>We propose GeoEyes, a training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom
arXiv Detail & Related papers (2026-02-15T15:50:55Z)
FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning [62.11389260206383]
textscFineRS is a two-stage MLLM-based reinforcement learning framework for segmenting extremely small objects.<n>We present textscFineRS-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets.
arXiv Detail & Related papers (2025-10-24T10:14:17Z)
Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping [43.14520214157644]
AttWarp is a lightweight method that allocates more resolution to query-relevant content while compressing less informative areas.<n>At test time, the approach uses an MLLM's cross-modal attention to perform rectilinear warping of the input image.<n>This attention-guided warping preserves all original image information but redistributes it non-uniformly.
arXiv Detail & Related papers (2025-10-10T17:57:06Z)
Zoom-Refine: Boosting High-Resolution Multimodal Understanding via Localized Zoom and Self-Refinement [17.824841346088903]
Multimodal Large Language Models (MLLM) often struggle to interpret high-resolution images accurately.<n>We introduce Zoom-Refine, a novel training-free method that enhances MLLM capabilities to address this issue.<n>Our method harnesses the MLLM's inherent capabilities for spatial localization, contextual reasoning and comparative analysis without requiring additional training or external experts.
arXiv Detail & Related papers (2025-06-02T13:32:35Z)
Efficiently Disentangling CLIP for Multi-Object Perception [62.523137132812764]
Vision-language models like CLIP excel at recognizing the single, prominent object in a scene, but struggle in complex scenes containing multiple objects.<n>We propose DCLIP, an efficient framework that learns an optimal level of mutual information while adding only minimal learnable parameters to a frozen VLM.
arXiv Detail & Related papers (2025-02-05T08:20:31Z)
RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMs [38.34856927170692]
We propose a training-free framework for analyzing trained Multimodal Large Language Model (MLLM)<n>It consists of Probe-Activated Dynamic FFN and Hollow Attention, which enable adjustable reductions in computations for visual tokens.<n>Experiments demonstrate substantial, structured, and clustered redundancy unique to decoder-only MLLMs.
arXiv Detail & Related papers (2025-01-31T11:09:16Z)
Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid [87.09900996643516]
We introduce a Complementary Image Pyramid (CIP) to mitigate semantic discontinuity during high-resolution image processing. We also introduce a Scale Compression Mechanism (SCM) to reduce the additional computational overhead by compressing the redundant visual tokens. Our experiments demonstrate that CIP can consistently enhance the performance across diverse architectures.
arXiv Detail & Related papers (2024-08-04T13:55:58Z)
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs. We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets. We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z)
SufrinNet: Toward Sufficient Cross-View Interaction for Stereo Image Enhancement in The Dark [119.01585302856103]
Low-light stereo image enhancement (LLSIE) is a relatively new task to enhance the quality of visually unpleasant stereo images captured in dark conditions. Current methods clearly suffer from two shortages: 1) insufficient cross-view interaction; 2) lacking long-range dependency for intra-view learning. We propose a novel LLSIE model, termed underlineSufficient Cunderlineross-View underlineInteraction Network (SufrinNet)
arXiv Detail & Related papers (2022-11-02T04:01:30Z)
Boosting Few-shot Fine-grained Recognition with Background Suppression and Foreground Alignment [53.401889855278704]
Few-shot fine-grained recognition (FS-FGR) aims to recognize novel fine-grained categories with the help of limited available samples. We propose a two-stage background suppression and foreground alignment framework, which is composed of a background activation suppression (BAS) module, a foreground object alignment (FOA) module, and a local to local (L2L) similarity metric. Experiments conducted on multiple popular fine-grained benchmarks demonstrate that our method outperforms the existing state-of-the-art by a large margin.
arXiv Detail & Related papers (2022-10-04T07:54:40Z)
Multi-View Stereo Network with attention thin volume [0.0]
We propose an efficient multi-view stereo (MVS) network for infering depth value from multiple RGB images. We introduce the self-attention mechanism to fully aggregate the dominant information from input images. We also introduce the group-wise correlation to feature aggregation, which greatly reduces the memory and calculation burden.
arXiv Detail & Related papers (2021-10-16T11:51:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.