Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts
- URL: http://arxiv.org/abs/2503.07503v3
- Date: Tue, 25 Mar 2025 07:05:14 GMT
- Title: Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts
- Authors: Shiu-hong Kao, Yu-Wing Tai, Chi-Keung Tang,
- Abstract summary: ThinkFirst is a training-free reasoning segmentation framework.<n>Our approach allows GPT-4o or other powerful MLLMs to generate a detailed, chain-of-thought description of an image.<n>This summarized description is then passed to a language-instructed segmentation assistant to aid the segmentation process.
- Score: 64.93416171745693
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reasoning segmentation is a challenging vision-language task that aims to output the segmentation mask with respect to a complex, implicit, and even non-visual query text. Previous works incorporated multimodal Large Language Models (MLLMs) with segmentation models to approach the difficult problem. However, their segmentation quality often falls short in complex cases, particularly when dealing with out-of-domain objects with intricate structures, blurry boundaries, occlusions, or high similarity with surroundings. In this paper, we introduce ThinkFirst, a training-free reasoning segmentation framework that leverages GPT's chain of thought to address these challenging cases. Our approach allows GPT-4o or other powerful MLLMs to generate a detailed, chain-of-thought description of an image. This summarized description is then passed to a language-instructed segmentation assistant to aid the segmentation process. Our framework allows users to easily interact with the segmentation agent using multimodal inputs, such as easy text and image scribbles, for successive refinement or communication. We evaluate the performance of ThinkFirst on diverse objects. Extensive experiments show that, this zero-shot-CoT approach significantly improves the vanilla reasoning segmentation agent, both qualitatively and quantitatively, while being less sensitive or critical to user-supplied prompts after Thinking First.
Related papers
- CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models [2.331828779757202]
We present CALICO, the first Large Vision-Language Models (LVLM) designed for multi-image part-level reasoning segmentation.
CALICO features two key components, a novel Correspondence Extraction Module that identifies semantic part-level correspondences, and Adaptation Correspondence Modules that embed this information into the LVLM.
We show that CALICO, with just 0.3% of its parameters finetuned, achieves strong performance on this challenging task.
arXiv Detail & Related papers (2024-12-26T18:59:37Z) - Visual Prompt Selection for In-Context Learning Segmentation [77.15684360470152]
In this paper, we focus on rethinking and improving the example selection strategy.
We first demonstrate that ICL-based segmentation models are sensitive to different contexts.
Furthermore, empirical evidence indicates that the diversity of contextual prompts plays a crucial role in guiding segmentation.
arXiv Detail & Related papers (2024-07-14T15:02:54Z) - CoReS: Orchestrating the Dance of Reasoning and Segmentation [17.767049542947497]
We believe that the act of reasoning segmentation should mirror the cognitive stages of human visual search.
We introduce the Chains of Reasoning and Segmenting (CoReS) and find this top-down visual hierarchy indeed enhances the visual search process.
Experiments demonstrate the superior performance of our CoReS, which surpasses the state-of-the-art method by 6.5% on the ReasonSeg dataset.
arXiv Detail & Related papers (2024-04-08T16:55:39Z) - DeiSAM: Segment Anything with Deictic Prompting [26.38776252198988]
DeiSAM is a combination of large pre-trained neural networks with differentiable logic reasoners.<n>It segments objects by matching them to logically inferred image regions.<n>Our empirical results demonstrate that DeiSAM is a substantial improvement over purely data-driven baselines.
arXiv Detail & Related papers (2024-02-21T20:43:49Z) - SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation [87.18373801829314]
In-context segmentation aims at segmenting novel images using a few labeled example images, termed as "in-context examples"
We propose SEGIC, an end-to-end segment-in-context framework built upon a single vision foundation model (VFM)
SEGIC is a straightforward yet effective approach that yields state-of-the-art performance on one-shot segmentation benchmarks.
arXiv Detail & Related papers (2023-11-24T18:59:42Z) - LISA: Reasoning Segmentation via Large Language Model [68.24075852136761]
We propose a new segmentation task -- reasoning segmentation.
The task is designed to output a segmentation mask given a complex and implicit query text.
We present LISA: large Language Instructed Assistant, which inherits the language generation capabilities of multimodal Large Language Models.
arXiv Detail & Related papers (2023-08-01T17:50:17Z) - Semantics-Aware Dynamic Localization and Refinement for Referring Image
Segmentation [102.25240608024063]
Referring image segments an image from a language expression.
We develop an algorithm that shifts from being localization-centric to segmentation-language.
Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.