CoReS: Orchestrating the Dance of Reasoning and Segmentation
- URL: http://arxiv.org/abs/2404.05673v3
- Date: Thu, 11 Jul 2024 03:25:06 GMT
- Title: CoReS: Orchestrating the Dance of Reasoning and Segmentation
- Authors: Xiaoyi Bao, Siyang Sun, Shuailei Ma, Kecheng Zheng, Yuxin Guo, Guosheng Zhao, Yun Zheng, Xingang Wang,
- Abstract summary: We believe that the act of reasoning segmentation should mirror the cognitive stages of human visual search.
We introduce the Chains of Reasoning and Segmenting (CoReS) and find this top-down visual hierarchy indeed enhances the visual search process.
Experiments demonstrate the superior performance of our CoReS, which surpasses the state-of-the-art method by 6.5% on the ReasonSeg dataset.
- Score: 17.767049542947497
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The reasoning segmentation task, which demands a nuanced comprehension of intricate queries to accurately pinpoint object regions, is attracting increasing attention. However, Multi-modal Large Language Models (MLLM) often find it difficult to accurately localize the objects described in complex reasoning contexts. We believe that the act of reasoning segmentation should mirror the cognitive stages of human visual search, where each step is a progressive refinement of thought toward the final object. Thus we introduce the Chains of Reasoning and Segmenting (CoReS) and find this top-down visual hierarchy indeed enhances the visual search process. Specifically, we propose a dual-chain structure that generates multi-modal, chain-like outputs to aid the segmentation process. Furthermore, to steer the MLLM's outputs into this intended hierarchy, we incorporate in-context inputs as guidance. Extensive experiments demonstrate the superior performance of our CoReS, which surpasses the state-of-the-art method by 6.5\% on the ReasonSeg dataset. Project: https://chain-of-reasoning-and-segmentation.github.io/.
Related papers
- Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies [69.28082193942991]
This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills.
utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches.
To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR)
arXiv Detail & Related papers (2024-06-16T12:58:31Z) - Frequency-based Matcher for Long-tailed Semantic Segmentation [22.199174076366003]
We focus on a relatively under-explored task setting, long-tailed semantic segmentation (LTSS)
We propose a dual-metric evaluation system and construct the LTSS benchmark to demonstrate the performance of semantic segmentation methods and long-tailed solutions.
We also propose a transformer-based algorithm to improve LTSS, frequency-based matcher, which solves the oversuppression problem by one-to-many matching.
arXiv Detail & Related papers (2024-06-06T09:57:56Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - LaSagnA: Language-based Segmentation Assistant for Complex Queries [39.620806493454616]
Large Language Models for Vision (vLLMs) generate detailed perceptual outcomes, including bounding boxes and masks.
In this study, we acknowledge that the main cause of these problems is the insufficient complexity of training queries.
We present three novel strategies to effectively handle the challenges arising from the direct integration of the proposed format.
arXiv Detail & Related papers (2024-04-12T14:40:45Z) - DeiSAM: Segment Anything with Deictic Prompting [27.960890657540443]
DeiSAM is a combination of large pre-trained neural networks with differentiable logic reasoners.
It segments objects by matching them to logically inferred image regions.
Our empirical results demonstrate that DeiSAM is a substantial improvement over purely data-driven baselines.
arXiv Detail & Related papers (2024-02-21T20:43:49Z) - Coarse-to-Fine Amodal Segmentation with Shape Prior [52.38348188589834]
Amodal object segmentation is a challenging task that involves segmenting both visible and occluded parts of an object.
We propose a novel approach called Coarse-to-Fine: C2F-Seg, that addresses this problem by progressively modeling the amodal segmentation.
arXiv Detail & Related papers (2023-08-31T15:56:29Z) - LISA: Reasoning Segmentation via Large Language Model [68.24075852136761]
We propose a new segmentation task -- reasoning segmentation.
The task is designed to output a segmentation mask given a complex and implicit query text.
We present LISA: large Language Instructed Assistant, which inherits the language generation capabilities of multimodal Large Language Models.
arXiv Detail & Related papers (2023-08-01T17:50:17Z) - Semantics-Aware Dynamic Localization and Refinement for Referring Image
Segmentation [102.25240608024063]
Referring image segments an image from a language expression.
We develop an algorithm that shifts from being localization-centric to segmentation-language.
Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z) - Referring Transformer: A One-step Approach to Multi-task Visual
Grounding [45.42959940733406]
We propose a simple one-stage multi-task framework for visual grounding tasks.
Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder.
We show that our model benefits greatly from contextualized information and multi-task training.
arXiv Detail & Related papers (2021-06-06T10:53:39Z) - Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression.
Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask.
We present a "Then-Then-Segment" scheme to tackle these problems.
Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.