VGR: Visual Grounded Reasoning
- URL: http://arxiv.org/abs/2506.11991v2
- Date: Mon, 16 Jun 2025 07:35:52 GMT
- Title: VGR: Visual Grounded Reasoning
- Authors: Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, Jun Xiao,
- Abstract summary: This paper introduces VGR, a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities.<n>Unlike traditional MLLMs that answer the question or reasoning solely on the language space, our VGR first detects relevant regions that may help to solve problems, and then provides precise answers based on replayed image regions.
- Score: 24.19194463566865
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities. Unlike traditional MLLMs that answer the question or reasoning solely on the language space, our VGR first detects relevant regions that may help to solve problems, and then provides precise answers based on replayed image regions. To achieve this, we conduct a large-scale SFT dataset called VGR -SFT that contains reasoning data with mixed vision grounding and language deduction. The inference pipeline of VGR allows the model to choose bounding boxes for visual reference and a replay stage is introduced to integrates the corresponding regions into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multi-modal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30\% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and a +12.9 improvement on ChartQA.
Related papers
- MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection [55.702662643521265]
We propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to explore the semantic interaction capabilities of multimodal data.<n> Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods.
arXiv Detail & Related papers (2025-08-03T02:50:08Z) - MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs [6.165053219836395]
We propose MMGraphRAG, which refines visual content through scene graphs and constructs a multimodal knowledge graph.<n>It employs spectral clustering to achieve cross-modal entity linking and retrieves context along reasoning paths to guide the generative process.<n> Experimental results show that MMGraphRAG achieves state-of-the-art performance on the DocBench and MMLongBench datasets.
arXiv Detail & Related papers (2025-07-28T13:16:23Z) - MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization [19.70803794316208]
Medical Image Grounding (MIG) involves localizing specific regions in medical images based on textual descriptions.<n>Existing Vision-Language Models (VLMs) for MIG often rely on Supervised Fine-Tuning (SFT) with large amounts of Chain-of-Thought (CoT) reasoning annotations.<n>We propose the Spatial-Semantic Rewarded Group Relative Policy Optimization to train the model without CoT reasoning annotations.
arXiv Detail & Related papers (2025-07-01T21:51:42Z) - Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z) - VLM-R$^3$: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought [51.43082554363725]
We introduce textbfVLM-R$3$ (textbfVisual textbfLanguage textbfModel with textbfRegion textbfRecognition and textbfReasoning), a framework that equips an MLLM with the ability to decide emph when additional visual evidence is needed.<n>Experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R$3$ sets a new
arXiv Detail & Related papers (2025-05-22T03:50:13Z) - GRIT: Teaching MLLMs to Think with Images [22.74533687444133]
Grounded Reasoning with Images and Texts (GRIT) is a novel method for training MLLMs to think with images.<n>GRIT generates reasoning chains that interleave natural language and explicit bounding box coordinates.<n>GRIT achieves exceptional data efficiency, requiring as few as 20 image-question-answer triplets from existing datasets.
arXiv Detail & Related papers (2025-05-21T17:54:49Z) - Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries [30.692007887121278]
Retrieval-Augmented Generation (RAG) is a popular approach for enhancing Large Language Models (LLMs)<n>Visual-RAG requires text-to-image retrieval and integration of relevant clue images to extract visual knowledge as evidence.
arXiv Detail & Related papers (2025-02-23T16:23:50Z) - LLV-FSR: Exploiting Large Language-Vision Prior for Face Super-resolution [67.23699927053191]
We propose a new framework called LLV-FSR, which marries the power of large vision-language model and higher-order visual prior with the challenging task of face super-resolution.
Experimental results demonstrate that our proposed framework significantly improves both the reconstruction quality and perceptual quality, surpassing the SOTA by 0.43dB in terms of PSNR on the MMCelebA-HQ dataset.
arXiv Detail & Related papers (2024-11-14T09:12:18Z) - Contrastive Region Guidance: Improving Grounding in Vision-Language
Models without Training [79.27663870280038]
We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source vision-language models to respond to visual prompts.
When region annotations are provided, CRG increases absolute accuracy by up to 11.1% on ViP-Bench.
We also show CRG's applicability to spatial reasoning, with 10% improvement on What'sUp.
arXiv Detail & Related papers (2024-03-04T18:55:30Z) - GROUNDHOG: Grounding Large Language Models to Holistic Segmentation [22.347590874621865]
We introduce GROUNDHOG, an MLLM developed by grounding Large Language Models to holistic segmentation.
GROUNDHOG incorporates a masked feature extractor and converts extracted features into visual entity tokens for the MLLM backbone.
Our experimental results show that GROUNDHOG achieves superior performance on various language grounding tasks without task-specific fine-tuning.
arXiv Detail & Related papers (2024-02-26T18:59:33Z) - GLaMM: Pixel Grounding Large Multimodal Model [57.91763410032292]
We present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks.
GLaMM is flexible enough to accept both textual and optional visual prompts (region of interest) as input.
Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale.
arXiv Detail & Related papers (2023-11-06T18:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.