RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought
- URL: http://arxiv.org/abs/2506.04277v1
- Date: Wed, 04 Jun 2025 02:07:40 GMT
- Title: RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought
- Authors: Yi Lu, Jiawang Cao, Yongliang Wu, Bozheng Li, Licheng Tang, Yangguang Ji, Chong Wu, Jay Wu, Wenbo Zhu,
- Abstract summary: RSVP is a framework that unifies multi-step multimodal reasoning with grounded visual understanding.<n> RSVP exploits MLLMs' inherent localization capabilities, enabling the models to not only reason about objects but also generate structured visual representations.<n>Our experiments demonstrate RSVP state-of-the-art performance, surpasses state-of-the-art methods by up to +6.5 gIoU on ReasonSeg, and achieves 49.7 mAP on SegInW under zero-shot settings.
- Score: 6.037123011622866
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated remarkable reasoning capability while lack explicit mechanisms for visual grounding and segmentation, creating a gap between cognitive reasoning and visual perception. To bridge this gap, we introduce Reasoning Segmentation via Visual Prompting (RSVP), a novel framework that unifies multi-step multimodal reasoning with grounded visual understanding. RSVP is a two-stage structuralized framework that integrates reasoning-driven localization with segmentation refinement. In the reasoning stage, RSVP employs multimodal chain-of-thought visual prompts to help MLLMs understand queries and infer targets, generating interpretable region proposals that enhance visual grounding. In segmentation stage, RSVP refines these proposals with a Vision-Language Segmentation Module (VLSM), seamlessly integrates textual and visual cues to produce precise segmentation masks. By explicitly modelling the interaction between multimodal reasoning and segmentation, RSVP introduces a new paradigm for interpretable reasoning segmentation. It exploits MLLMs' inherent localization capabilities, enabling the models to not only reason about objects but also generate structured visual representations. Our extensive experiments demonstrate that RSVP achieves state-of-the-art performance, surpasses state-of-the-art methods by up to +6.5 gIoU and +9.2 cIoU on ReasonSeg, and achieves 49.7 mAP on SegInW under zero-shot settings. These results validate RSVP as an effective and scalable framework for integrating cognitive reasoning with structured visual understanding.
Related papers
- Remote Sensing Large Vision-Language Model: Semantic-augmented Multi-level Alignment and Semantic-aware Expert Modeling [42.46176089721314]
Large Vision and Language Models (LVLMs) have shown strong performance across various vision-language tasks in natural image domains.<n>Their application to remote sensing (RS) remains underexplored due to significant domain differences in visual appearances, object scales, and semantics.<n>We propose a novel LVLM framework tailored for RS understanding, incorporating two core components: Semantic-augmented Multi-level Alignment and Semantic-aware Expert Modeling.
arXiv Detail & Related papers (2025-06-27T02:31:37Z) - Decoupling the Image Perception and Multimodal Reasoning for Reasoning Segmentation with Digital Twin Representations [48.98219448782818]
Reasoning (RS) is a multimodal vision-text task that requires segmenting objects based on implicit text queries.<n>Current RS approaches rely on fine-tuning vision-language models (VLMs) for both perception and reasoning.<n>We introduce DTwinSeger, a novel RS approach that leverages Digital Twin representation as an intermediate layer to decouple perception from reasoning.
arXiv Detail & Related papers (2025-06-09T17:05:02Z) - Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification [22.871255950998016]
We introduce a novel framework for inference-time visual tokens scaling that enables MLLMs to perform verifier-guided reasoning over visual content.<n>Our method significantly outperforms existing approaches across diverse visual reasoning benchmarks.<n>These results demonstrate the promise of dynamic inference mechanisms for enabling fine-grained, context-aware visual reasoning in next-generation MLLMs.
arXiv Detail & Related papers (2025-06-08T17:38:49Z) - Progressive Language-guided Visual Learning for Multi-Task Visual Grounding [21.297317604403652]
We propose a Progressive Language-guided Visual Learning framework for multi-task visual grounding.<n>In this paper, we propose a Progressive Language-guided Visual Learning framework for multi-task visual grounding.
arXiv Detail & Related papers (2025-04-22T12:48:12Z) - Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation [22.057386630831402]
Large Vision-Language Models can be instructed to solve diverse tasks by prompting, without task-specific training.<n>We evaluate the segmentation performance of several recent models guided by either text or visual prompts.<n>We propose PromptMatcher, a training-free baseline that combines both text and visual prompts.
arXiv Detail & Related papers (2025-03-25T13:36:59Z) - VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making [21.61801132083334]
VIPER is a novel framework for multimodal instruction-based planning.<n>It integrates VLM-based perception with LLM-based reasoning.<n>We show that VIPER significantly outperforms state-of-the-art visual instruction-based planners.
arXiv Detail & Related papers (2025-03-19T11:05:42Z) - Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts [64.93416171745693]
ThinkFirst is a training-free reasoning segmentation framework.<n>Our approach allows GPT-4o or other powerful MLLMs to generate a detailed, chain-of-thought description of an image.<n>This summarized description is then passed to a language-instructed segmentation assistant to aid the segmentation process.
arXiv Detail & Related papers (2025-03-10T16:26:11Z) - EAGLE: Towards Efficient Arbitrary Referring Visual Prompts Comprehension for Multimodal Large Language Models [80.00303150568696]
We propose a novel Multimodal Large Language Models (MLLM) that empowers comprehension of arbitrary referring visual prompts with less training efforts than existing approaches.
Our approach embeds referring visual prompts as spatial concepts conveying specific spatial areas comprehensible to the MLLM.
We also propose a Geometry-Agnostic Learning paradigm (GAL) to further disentangle the MLLM's region-level comprehension with the specific formats of referring visual prompts.
arXiv Detail & Related papers (2024-09-25T08:22:00Z) - CoReS: Orchestrating the Dance of Reasoning and Segmentation [17.767049542947497]
We believe that the act of reasoning segmentation should mirror the cognitive stages of human visual search.
We introduce the Chains of Reasoning and Segmenting (CoReS) and find this top-down visual hierarchy indeed enhances the visual search process.
Experiments demonstrate the superior performance of our CoReS, which surpasses the state-of-the-art method by 6.5% on the ReasonSeg dataset.
arXiv Detail & Related papers (2024-04-08T16:55:39Z) - Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning [40.972648044298374]
Multi-Modal Large Language Models (MLLMs) have demonstrated impressive performance in various VQA tasks.
They often lack interpretability and struggle with complex visual inputs.
We introduce the large-scale Visual CoT dataset comprising 438k question-answer pairs.
We propose a multi-turn processing pipeline that dynamically focuses on visual inputs and provides interpretable thoughts.
arXiv Detail & Related papers (2024-03-25T17:59:23Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - See, Think, Confirm: Interactive Prompting Between Vision and Language
Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning.
IPVR contains three stages, see, think and confirm.
We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.