DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models
- URL: http://arxiv.org/abs/2603.03857v1
- Date: Wed, 04 Mar 2026 09:06:47 GMT
- Title: DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models
- Authors: Yangfu Li, Hongjian Zhan, Jiawei Chen, Yuning Gong, Qi Liu, Yue Lu,
- Abstract summary: We propose DeepScan, a training-free framework that combines Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning.<n> Experimental results demonstrate that DeepScan significantly boosts LVLMs in diverse visual tasks.<n>It achieves 90.6% overall accuracy on V* when integrated with Qwen2.5-VL-7B.
- Score: 17.001413023262675
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans can robustly localize visual evidence and provide grounded answers even in noisy environments by identifying critical cues and then relating them to the full context in a bottom-up manner. Inspired by this, we propose DeepScan, a training-free framework that combines Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning for visually grounded reasoning in Large Vision-Language Models (LVLMs). Unlike existing methods that pursue one-shot localization of complete evidence, Hierarchical Scanning performs local cue exploration and multi-scale evidence extraction to recover evidence in a bottom-up manner, effectively mitigating the impacts of distractive context. Refocusing then optimizes the localized evidence view through collaboration of LVLMs and visual experts. Finally, Evidence-Enhanced Reasoning aggregates multi-granular views via a hybrid evidence memory and yields accurate and interpretable answers. Experimental results demonstrate that DeepScan significantly boosts LVLMs in diverse visual tasks, especially in fine-grained visual understanding. It achieves 90.6% overall accuracy on V* when integrated with Qwen2.5-VL-7B. Moreover, DeepScan provides consistent improvements for LVLMs across various architectures and model scales without additional adaptation cost.
Related papers
- VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation [64.82775032985485]
Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations.<n>Yet current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions.<n>We propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue.
arXiv Detail & Related papers (2025-10-10T13:34:23Z) - SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding [5.976839106353883]
SECOND: Selective and Contrastive Decoding is a novel approach that enables Vision-Language Models to leverage multi-scale visual information with an object-centric manner.<n> SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks.
arXiv Detail & Related papers (2025-06-10T02:55:38Z) - mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation [15.991125806837386]
Large Vision-Language Models (LVLMs) have made remarkable strides in multimodal tasks such as visual question answering, visual grounding, and complex reasoning.<n>Retrieval-Augmented Generation (RAG) offers a practical solution to mitigate these challenges by allowing the LVLMs to access large-scale knowledge databases via retrieval mechanisms.
arXiv Detail & Related papers (2025-05-29T23:32:03Z) - DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding [61.26026947423187]
Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features.<n>Current Multimodal Large Language Models (MLLMs) struggle to integrate reasoning into visual perception.<n>We propose DeepPerception, an MLLM enhanced with cognitive visual perception capabilities.
arXiv Detail & Related papers (2025-03-17T04:06:34Z) - Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.<n>Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.<n>We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z) - Multimodal Misinformation Detection using Large Vision-Language Models [7.505532091249881]
Large language models (LLMs) have shown remarkable performance in various tasks.
Few approaches consider evidence retrieval as part of misinformation detection.
We propose a novel re-ranking approach for multimodal evidence retrieval.
arXiv Detail & Related papers (2024-07-19T13:57:11Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.<n>Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.<n>We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Good Questions Help Zero-Shot Image Reasoning [110.1671684828904]
Question-Driven Visual Exploration (QVix) is a novel prompting strategy that enhances the exploratory capabilities of large vision-language models (LVLMs)
QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment.
Our evaluations on various challenging zero-shot vision-language benchmarks, including ScienceQA and fine-grained visual classification, demonstrate that QVix significantly outperforms existing methods.
arXiv Detail & Related papers (2023-12-04T03:18:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.