CPSeg: Finer-grained Image Semantic Segmentation via Chain-of-Thought
Language Prompting
- URL: http://arxiv.org/abs/2310.16069v2
- Date: Thu, 26 Oct 2023 12:35:37 GMT
- Title: CPSeg: Finer-grained Image Semantic Segmentation via Chain-of-Thought
Language Prompting
- Authors: Lei Li
- Abstract summary: CPSeg is a framework designed to augment image segmentation performance by integrating a novel "Chain-of-Thought" process.
We propose a new vision-language dataset, FloodPrompt, which includes images, semantic masks, and corresponding text information.
- Score: 8.12405696290333
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural scene analysis and remote sensing imagery offer immense potential for
advancements in large-scale language-guided context-aware data utilization.
This potential is particularly significant for enhancing performance in
downstream tasks such as object detection and segmentation with designed
language prompting. In light of this, we introduce the CPSeg, Chain-of-Thought
Language Prompting for Finer-grained Semantic Segmentation), an innovative
framework designed to augment image segmentation performance by integrating a
novel "Chain-of-Thought" process that harnesses textual information associated
with images. This groundbreaking approach has been applied to a flood disaster
scenario. CPSeg encodes prompt texts derived from various sentences to
formulate a coherent chain-of-thought. We propose a new vision-language
dataset, FloodPrompt, which includes images, semantic masks, and corresponding
text information. This not only strengthens the semantic understanding of the
scenario but also aids in the key task of semantic segmentation through an
interplay of pixel and text matching maps. Our qualitative and quantitative
analyses validate the effectiveness of CPSeg.
Related papers
- Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model [61.389233691596004]
We introduce the DiffPNG framework, which capitalizes on the diffusion's architecture for segmentation by decomposing the process into a sequence of localization, segmentation, and refinement steps.
Our experiments on the PNG dataset demonstrate that DiffPNG achieves strong performance in the zero-shot PNG task setting.
arXiv Detail & Related papers (2024-07-07T13:06:34Z) - Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation [28.24883865053459]
This paper aims to learn a model capable of segmenting arbitrary visual concepts within images by using only image-text pairs without dense annotations.
Existing methods have demonstrated that contrastive learning on image-text pairs effectively aligns visual segments with the meanings of texts.
A text often consists of multiple semantic concepts, whereas semantic segmentation strives to create semantically homogeneous segments.
arXiv Detail & Related papers (2024-04-05T17:25:17Z) - Language Guided Domain Generalized Medical Image Segmentation [68.93124785575739]
Single source domain generalization holds promise for more reliable and consistent image segmentation across real-world clinical settings.
We propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features.
Our approach achieves favorable performance against existing methods in literature.
arXiv Detail & Related papers (2024-04-01T17:48:15Z) - Language-guided Few-shot Semantic Segmentation [23.46604057006498]
We propose an innovative solution to tackle the challenge of few-shot semantic segmentation using only language information.
Our approach involves a vision-language-driven mask distillation scheme, which generates high quality pseudo-semantic masks from text prompts.
Experiments on two benchmark datasets demonstrate that our method establishes a new baseline for language-guided few-shot semantic segmentation.
arXiv Detail & Related papers (2023-11-23T09:08:49Z) - Rewrite Caption Semantics: Bridging Semantic Gaps for
Language-Supervised Semantic Segmentation [100.81837601210597]
We propose Concept Curation (CoCu) to bridge the gap between visual and textual semantics in pre-training data.
CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin.
arXiv Detail & Related papers (2023-09-24T00:05:39Z) - ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View
Semantic Consistency [126.88107868670767]
We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation.
We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image.
We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
arXiv Detail & Related papers (2023-01-31T01:57:52Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Fine-grained Image Classification and Retrieval by Combining Visual and
Locally Pooled Textual Features [8.317191999275536]
In particular, the mere presence of text provides strong guiding content that should be employed to tackle a diversity of computer vision tasks.
In this paper, we address the problem of fine-grained classification and image retrieval by leveraging textual information along with visual cues to comprehend the existing intrinsic relation between the two modalities.
arXiv Detail & Related papers (2020-01-14T12:06:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.