Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning
- URL: http://arxiv.org/abs/2602.23615v1
- Date: Fri, 27 Feb 2026 02:43:35 GMT
- Title: Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning
- Authors: Jiacheng Yang, Anqi Chen, Yunkai Dang, Qi Fan, Cong Wang, Wenbin Li, Feng Miao, Yang Gao,
- Abstract summary: A common practice is to identify key image regions and refer to their high-resolution counterparts during reasoning.<n>It remains an open question how to enhance a model's grounding abilities to support reasoning without relying on additional annotations.<n>We propose High-resolution.<n>free Reasoning Technique (HART), a closed-loop framework that enables LMMs to focus on and self-verify key regions.
- Score: 17.81009868725361
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current Large Multimodal Models (LMMs) struggle with high-resolution visual inputs during the reasoning process, as the number of image tokens increases quadratically with resolution, introducing substantial redundancy and irrelevant information. A common practice is to identify key image regions and refer to their high-resolution counterparts during reasoning, typically trained with external visual supervision. However, such visual supervision cues require costly grounding labels from human annotators. Meanwhile, it remains an open question how to enhance a model's grounding abilities to support reasoning without relying on additional annotations. In this paper, we propose High-resolution Annotation-free Reasoning Technique (HART), a closed-loop framework that enables LMMs to focus on and self-verify key regions of high-resolution visual inputs. HART incorporates a post-training paradigm in which we design Advantage Preference Group Relative Policy Optimization (AP-GRPO) to encourage accurate localization of key regions. Notably, HART provides explainable reasoning pathways and enables efficient optimization of localization. Extensive experiments demonstrate that HART improves performance across a wide range of high-resolution visual tasks, consistently outperforming strong baselines. When applied to post-train Qwen2.5-VL-7B, HART even surpasses larger-scale models such as Qwen2.5-VL-72B and LLaVA-OneVision-72B on high-resolution, vision-centric benchmarks.
Related papers
- Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage [4.771792258699647]
We propose textbfHead Aware Visual Cropping (HAVC), a training-free method that improves visual grounding by leveraging a selectively refined subset of attention heads.<n> experiments on multiple fine-grained VQA benchmarks demonstrate that HAVC consistently outperforms state-of-the-art cropping strategies.
arXiv Detail & Related papers (2026-01-30T02:46:55Z) - Zooming into Comics: Region-Aware RL Improves Fine-Grained Comic Understanding in Vision-Language Models [23.954335269506576]
Complex visual narratives, such as comics, present a significant challenge to Vision-Language Models (VLMs)<n>We introduce AI4VA-FG, the first fine-grained and comprehensive benchmark for VLM-based comic understanding.<n>We also evaluate state-of-the-art proprietary models, including GPT-4o and Gemini-2.5, and open-source models such as Qwen2.5-VL.
arXiv Detail & Related papers (2025-11-09T18:27:45Z) - Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance [60.028070589466445]
Pyramid Token Pruning (PTP) is a training-free strategy that hierarchically integrates bottom-up visual saliency at both region and token levels with top-down instruction-guided relevance.<n>We show that PTP substantially reduces computational cost, memory usage, and inference latency, with negligible performance degradation.
arXiv Detail & Related papers (2025-09-19T07:28:17Z) - HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models [60.028070589466445]
We propose HERO, a framework that integrates content-adaptive token budget allocation with function-aware token selection.<n>This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.
arXiv Detail & Related papers (2025-09-16T13:22:08Z) - Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding [31.57375084036447]
Vision Language Models (VLMs) have recently achieved significant progress in bridging visual perception and linguistic reasoning.<n>We propose LASER, a self-evolving framework that progressively endows VLMs with multi-step perception capabilities.<n>Our approach integrates Monte Carlo quality estimation with Intersection-over-Union (IoU)-based region quality evaluation to jointly encourage both accuracy and diversity in constructing high-quality preference data.
arXiv Detail & Related papers (2025-09-04T14:17:01Z) - Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL [70.1326027641056]
Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks.<n>We propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions.<n>We present a two-stage training pipeline, including supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-05-21T12:18:15Z) - Global Semantic-Guided Sub-image Feature Weight Allocation in High-Resolution Large Vision-Language Models [50.98559225639266]
Sub-images with higher semantic relevance to the entire image encapsulate richer visual information for preserving the model's visual understanding ability.<n>Global Semantic-guided Weight Allocator (GSWA) module allocates weights to sub-images based on their relative information density.<n>SleighVL, a lightweight yet high-performing model, outperforms models with comparable parameters and remains competitive with larger models.
arXiv Detail & Related papers (2025-01-24T06:42:06Z) - Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation [60.80423207808076]
Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation.<n>We propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi-scale convolutional operations.<n>We build HRVMamba, a novel model for efficient high-resolution representation learning.
arXiv Detail & Related papers (2024-10-04T06:19:29Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Improving Vision-and-Language Reasoning via Spatial Relations Modeling [30.477235227733928]
Visual commonsense reasoning (VCR) is a challenging multi-modal task.
The proposed method can guide the representations to maintain more spatial context.
We achieve the state-of-the-art results on VCR and two other vision-and-language reasoning tasks VQA, and NLVR.
arXiv Detail & Related papers (2023-11-09T11:54:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.