SIFThinker: Spatially-Aware Image Focus for Visual Reasoning
- URL: http://arxiv.org/abs/2508.06259v4
- Date: Tue, 16 Sep 2025 09:40:13 GMT
- Title: SIFThinker: Spatially-Aware Image Focus for Visual Reasoning
- Authors: Zhangquan Chen, Ruihui Zhao, Chuwei Luo, Mingze Sun, Xinlei Yu, Yangyang Kang, Ruqi Huang,
- Abstract summary: We introduce SIFThinker, a spatially-aware "think-with-images" framework that mimics human visual perception.<n>SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language.<n>In experiments, SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception.
- Score: 22.922568123298934
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware "think-with-images" framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset. Besides, we propose GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline, teaching the model to dynamically correct and focus on prompt-relevant regions. Extensive experiments demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception, while maintaining strong general capabilities, highlighting the effectiveness of our method. Code: https://github.com/zhangquanchen/SIFThinker.
Related papers
- AdaFocus: Knowing When and Where to Look for Adaptive Visual Reasoning [17.455916323311683]
We propose AdaFocus, a training-free framework for adaptive visual reasoning.<n>AdaFocus follows a two-stage pipeline: a confidence-based module decides when to crop, and a semantic-guided localization module determines where to crop.<n> Experimentally, AdaFocus delivers substantial performance gains while achieving approximately 4.0times speedup inference speedup.
arXiv Detail & Related papers (2026-02-26T15:41:26Z) - ConFoThinking: Consolidated Focused Attention Driven Thinking for Visual Question Answering [10.689628202869635]
ConFoThinking learns to aggregate attention into a designated intermediate layer, from which we mine and zoom in salient regions for downstream visual understanding.<n> Experiments across five VQA benchmarks demonstrate ConFoThinking significantly improves perception performance.
arXiv Detail & Related papers (2026-02-26T06:28:43Z) - Chatting with Images for Introspective Visual Thinking [50.7747647794877]
''Chatting with images'' is a new framework that reframes visual manipulation as language-guided feature modulation.<n>Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions.<n>ViLaVT achieves strong and consistent improvements on complex multi-image and video-based spatial reasoning tasks.
arXiv Detail & Related papers (2026-02-11T17:42:37Z) - Toward Cognitive Supersensing in Multimodal Large Language Model [67.15559571626747]
We introduce Cognitive Supersensing, a training paradigm that endows MLLMs with human-like visual imagery capabilities.<n>In experiments, MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench.<n>We will open-source the CogSense-Bench and our model weights.
arXiv Detail & Related papers (2026-02-02T02:19:50Z) - Monet: Reasoning in Latent Visual Space Beyond Images and Language [55.424507246294326]
"Thinking with images" has emerged as an effective paradigm for advancing visual reasoning.<n>Existing methods fall short of human-like abstract visual thinking.<n>We introduce Monet, a training framework that enables multimodal large language models to reason directly within the latent visual space.
arXiv Detail & Related papers (2025-11-26T13:46:39Z) - Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [62.447497430479174]
Drawing to reason in space is a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space.<n>Our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks.
arXiv Detail & Related papers (2025-06-11T17:41:50Z) - Align and Surpass Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning [18.13538667261998]
Current multi-modal models exhibit a notable misalignment with the human visual system when identifying objects that are visually assimilated into the background.<n>We build a visual system that mimicks human visual camouflaged perception to progressively and iteratively refocus' visual concealed content.
arXiv Detail & Related papers (2025-05-26T07:27:18Z) - VisRL: Intention-Driven Visual Perception via Reinforced Reasoning [22.907814548315468]
We propose VisRL, the first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception.<n>By treating intermediate focus selection as an internal decision optimized through trial-and-error, our method eliminates the need for costly region annotations.<n>Our method consistently outperforms strong baselines, demonstrating both its effectiveness and its strong generalization across different LMMs.
arXiv Detail & Related papers (2025-03-10T16:49:35Z) - Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas [52.478956204238315]
We study the spatial reasoning challenge from the lens of mechanistic interpretability.<n>We observe that successful spatial reasoning correlates strongly with the model's ability to align its attention with actual object locations.<n>Motivated by these findings, we propose ADAPTVIS to sharpen the attention on highly relevant regions when confident.
arXiv Detail & Related papers (2025-03-03T17:57:03Z) - Improving Vision-and-Language Reasoning via Spatial Relations Modeling [30.477235227733928]
Visual commonsense reasoning (VCR) is a challenging multi-modal task.
The proposed method can guide the representations to maintain more spatial context.
We achieve the state-of-the-art results on VCR and two other vision-and-language reasoning tasks VQA, and NLVR.
arXiv Detail & Related papers (2023-11-09T11:54:55Z) - Learned Image Reasoning Prior Penetrates Deep Unfolding Network for
Panchromatic and Multi-Spectral Image Fusion [45.28120834593148]
We propose a novel model-driven deep unfolding framework with image reasoning prior tailored for the pan-sharpening task.
Our framework is motivated by the content reasoning ability of masked autoencoders with insightful designs.
The uniqueness of our framework is that the holistic learning process is explicitly integrated with the inherent physical mechanism underlying the pan-sharpening task.
arXiv Detail & Related papers (2023-08-30T15:15:31Z) - A domain adaptive deep learning solution for scanpath prediction of
paintings [66.46953851227454]
This paper focuses on the eye-movement analysis of viewers during the visual experience of a certain number of paintings.
We introduce a new approach to predicting human visual attention, which impacts several cognitive functions for humans.
The proposed new architecture ingests images and returns scanpaths, a sequence of points featuring a high likelihood of catching viewers' attention.
arXiv Detail & Related papers (2022-09-22T22:27:08Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.