Related papers: CoReS: Orchestrating the Dance of Reasoning and Segmentation

CoReS: Orchestrating the Dance of Reasoning and Segmentation

URL: http://arxiv.org/abs/2404.05673v3
Date: Thu, 11 Jul 2024 03:25:06 GMT
Title: CoReS: Orchestrating the Dance of Reasoning and Segmentation
Authors: Xiaoyi Bao, Siyang Sun, Shuailei Ma, Kecheng Zheng, Yuxin Guo, Guosheng Zhao, Yun Zheng, Xingang Wang,
Abstract summary: We believe that the act of reasoning segmentation should mirror the cognitive stages of human visual search. We introduce the Chains of Reasoning and Segmenting (CoReS) and find this top-down visual hierarchy indeed enhances the visual search process. Experiments demonstrate the superior performance of our CoReS, which surpasses the state-of-the-art method by 6.5% on the ReasonSeg dataset.
Score: 17.767049542947497
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The reasoning segmentation task, which demands a nuanced comprehension of intricate queries to accurately pinpoint object regions, is attracting increasing attention. However, Multi-modal Large Language Models (MLLM) often find it difficult to accurately localize the objects described in complex reasoning contexts. We believe that the act of reasoning segmentation should mirror the cognitive stages of human visual search, where each step is a progressive refinement of thought toward the final object. Thus we introduce the Chains of Reasoning and Segmenting (CoReS) and find this top-down visual hierarchy indeed enhances the visual search process. Specifically, we propose a dual-chain structure that generates multi-modal, chain-like outputs to aid the segmentation process. Furthermore, to steer the MLLM's outputs into this intended hierarchy, we incorporate in-context inputs as guidance. Extensive experiments demonstrate the superior performance of our CoReS, which surpasses the state-of-the-art method by 6.5\% on the ReasonSeg dataset. Project: https://chain-of-reasoning-and-segmentation.github.io/.

Related papers

Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation [61.37076111486196]
Ref-AVS aims to segment target objects in audible videos based on given reference expressions.<n>We propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process.<n>Ref-Thinker is a multimodal language model capable of reasoning over textual, visual, and auditory cues.
arXiv Detail & Related papers (2025-08-06T13:05:09Z)
Multimodal Referring Segmentation: A Survey [93.24051010753817]
Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format.<n>Over the past decade, it has gained significant attention in the multimodal community, driven by advances in convolutional neural networks, transformers, and large language models.
arXiv Detail & Related papers (2025-08-01T02:14:00Z)
LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance [56.474856189865946]
Large multi-modal models (LMMs) struggle with inaccurate segmentation and hallucinated comprehension.<n>We propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation.<n>LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks.
arXiv Detail & Related papers (2025-07-08T07:46:26Z)
RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought [6.037123011622866]
RSVP is a framework that unifies multi-step multimodal reasoning with grounded visual understanding.<n> RSVP exploits MLLMs' inherent localization capabilities, enabling the models to not only reason about objects but also generate structured visual representations.<n>Our experiments demonstrate RSVP state-of-the-art performance, surpasses state-of-the-art methods by up to +6.5 gIoU on ReasonSeg, and achieves 49.7 mAP on SegInW under zero-shot settings.
arXiv Detail & Related papers (2025-06-04T02:07:40Z)
MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation [14.144097766150397]
We present a dataset called Multi-target and Multi-granularity Reasoning (MMR) MMR comprises 194K complex and implicit instructions that consider multi-target, object-level, and part-level aspects. We propose a straightforward yet effective framework for multi-target, object-level, and part-level reasoning segmentation.
arXiv Detail & Related papers (2025-03-18T04:23:09Z)
Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages. TVC helps the model retain attention to the visual components throughout the reasoning. Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z)
Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts [64.93416171745693]
ThinkFirst is a training-free reasoning segmentation framework. Our approach allows GPT-4o or other powerful MLLMs to generate a detailed, chain-of-thought description of an image. This summarized description is then passed to a language-instructed segmentation assistant to aid the segmentation process.
arXiv Detail & Related papers (2025-03-10T16:26:11Z)
Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies [69.28082193942991]
This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills. utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches. To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR)
arXiv Detail & Related papers (2024-06-16T12:58:31Z)
Frequency-based Matcher for Long-tailed Semantic Segmentation [22.199174076366003]
We focus on a relatively under-explored task setting, long-tailed semantic segmentation (LTSS) We propose a dual-metric evaluation system and construct the LTSS benchmark to demonstrate the performance of semantic segmentation methods and long-tailed solutions. We also propose a transformer-based algorithm to improve LTSS, frequency-based matcher, which solves the oversuppression problem by one-to-many matching.
arXiv Detail & Related papers (2024-06-06T09:57:56Z)
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z)
LaSagnA: Language-based Segmentation Assistant for Complex Queries [39.620806493454616]
Large Language Models for Vision (vLLMs) generate detailed perceptual outcomes, including bounding boxes and masks. In this study, we acknowledge that the main cause of these problems is the insufficient complexity of training queries. We present three novel strategies to effectively handle the challenges arising from the direct integration of the proposed format.
arXiv Detail & Related papers (2024-04-12T14:40:45Z)
DeiSAM: Segment Anything with Deictic Prompting [27.960890657540443]
DeiSAM is a combination of large pre-trained neural networks with differentiable logic reasoners. It segments objects by matching them to logically inferred image regions. Our empirical results demonstrate that DeiSAM is a substantial improvement over purely data-driven baselines.
arXiv Detail & Related papers (2024-02-21T20:43:49Z)
Coarse-to-Fine Amodal Segmentation with Shape Prior [52.38348188589834]
Amodal object segmentation is a challenging task that involves segmenting both visible and occluded parts of an object. We propose a novel approach called Coarse-to-Fine: C2F-Seg, that addresses this problem by progressively modeling the amodal segmentation.
arXiv Detail & Related papers (2023-08-31T15:56:29Z)
LISA: Reasoning Segmentation via Large Language Model [68.24075852136761]
We propose a new segmentation task -- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. We present LISA: large Language Instructed Assistant, which inherits the language generation capabilities of multimodal Large Language Models.
arXiv Detail & Related papers (2023-08-01T17:50:17Z)
Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation [102.25240608024063]
Referring image segments an image from a language expression. We develop an algorithm that shifts from being localization-centric to segmentation-language. Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z)
Referring Transformer: A One-step Approach to Multi-task Visual Grounding [45.42959940733406]
We propose a simple one-stage multi-task framework for visual grounding tasks. Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder. We show that our model benefits greatly from contextualized information and multi-task training.
arXiv Detail & Related papers (2021-06-06T10:53:39Z)
Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression. Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask. We present a "Then-Then-Segment" scheme to tackle these problems. Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.