Related papers: Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

URL: http://arxiv.org/abs/2505.15966v2
Date: Mon, 26 May 2025 03:10:29 GMT
Title: Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Authors: Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, Wenhu Chen,
Abstract summary: Chain-of-thought reasoning has significantly improved the performance of Large Language Models.<n>We introduce the concept of reasoning in the pixel-space.<n>We demonstrate that this approach significantly improves Vision-Language Models.
Score: 39.66636859076594
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model's initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.

Related papers

Cross-Modal Attention Guided Unlearning in Vision-Language Models [16.460281156521646]
Vision-Language Models (VLMs) have demonstrated immense capabilities in multi-modal understanding and inference tasks.<n>VLMs add a layer of complexity to this process, as the visual context in the query may also contain sensitive information in addition to the text.<n>We formulate Cross-Modal Attention Guided Unlearning (CAGUL), a lightweight and efficient VLM unlearning framework.
arXiv Detail & Related papers (2025-10-08T21:21:59Z)
Look Less, Reason More: Rollout-Guided Adaptive Pixel-Space Reasoning [35.475941880366726]
Vision-Language Models (VLMs) excel at many multimodal tasks, yet they frequently struggle with tasks requiring precise understanding and handling of fine-grained visual elements.<n>Recent work has shown promise by incorporating pixel-level visual information into the reasoning process.<n>We propose the first framework for adaptive pixel reasoning that dynamically determines necessary pixel-level operations based on the input query.
arXiv Detail & Related papers (2025-10-02T05:14:52Z)
Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models [63.69856480318313]
AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment.<n>We show that AGILE substantially boosts performance on jigsaw tasks of varying complexity.<n>We also demonstrate strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%.
arXiv Detail & Related papers (2025-10-01T17:58:05Z)
UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning [83.68366772745689]
We propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses.<n>Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference.<n>The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos.
arXiv Detail & Related papers (2025-09-22T17:59:40Z)
HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models [60.028070589466445]
We propose HERO, a framework that integrates content-adaptive token budget allocation with function-aware token selection.<n>This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.
arXiv Detail & Related papers (2025-09-16T13:22:08Z)
Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback [33.127607245587576]
We introduce a framework that enables MLLMs to learn complex visual reasoning from only raw images.<n>We demonstrate that this relative ease provides an ideal reward signal for optimization via Reinforcement Learning.<n>The RRVF-trained model not only outperforms existing MLLMs and supervised fine-tuning baselines but also exhibits superior generalization.
arXiv Detail & Related papers (2025-07-28T12:21:19Z)
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [62.447497430479174]
Drawing to reason in space is a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space.<n>Our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks.
arXiv Detail & Related papers (2025-06-11T17:41:50Z)
Caption This, Reason That: VLMs Caught in the Middle [3.4820139118440676]
Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years.<n>They still lag behind human capabilities in specific visual tasks such as counting or relational reasoning.<n>We analyze VLM performance along core cognitive axes: Perception, Attention, and Memory.
arXiv Detail & Related papers (2025-05-24T14:25:48Z)
Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection [53.558449071113245]
Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM)<n>Recent advancements in vision-language modeling introduce image cropping techniques that feed all encoded sub-images into the model.<n>We propose a lightweight, universal framework that seamlessly integrates with existing VLMs to enhance their ability to process finegrained details.
arXiv Detail & Related papers (2025-03-14T18:33:31Z)
Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems. These models have been shown to be highly capable, but also lacking some basic visual understanding skills. This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z)
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation [16.033361754660316]
Notice is the first Noise-free Text-Image Corruption and Evaluation pipeline for interpretability in Vision-Language Models (VLMs)<n>Our experiments on the SVO-Probes, MIT-States, and Facial Expression Recognition datasets reveal crucial insights into VLM decision-making.<n>This work paves the way for more transparent and interpretable multimodal systems.
arXiv Detail & Related papers (2024-06-24T05:13:19Z)
Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation [34.37450315995176]
Current Referring Video Object (RVOS) methods typically use vision and language models pretrained independently as backbones. We propose a temporal-aware prompt-tuning method, which adapts pretrained representations for pixel-level prediction. Our method performs favorably against state-of-the-art algorithms and exhibits strong generalization abilities.
arXiv Detail & Related papers (2024-05-17T08:14:22Z)
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL) Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning. We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning. IPVR contains three stages, see, think and confirm. We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.