CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation
- URL: http://arxiv.org/abs/2303.11797v2
- Date: Sun, 31 Mar 2024 11:53:55 GMT
- Title: CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation
- Authors: Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim,
- Abstract summary: We introduce a novel cost-based approach to adapt vision-language foundation models, notably CLIP.
Our method potently adapts CLIP for segmenting seen and unseen classes by fine-tuning its encoders.
- Score: 56.58365347854647
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-vocabulary semantic segmentation presents the challenge of labeling each pixel within an image based on a wide range of text descriptions. In this work, we introduce a novel cost-based approach to adapt vision-language foundation models, notably CLIP, for the intricate task of semantic segmentation. Through aggregating the cosine similarity score, i.e., the cost volume between image and text embeddings, our method potently adapts CLIP for segmenting seen and unseen classes by fine-tuning its encoders, addressing the challenges faced by existing methods in handling unseen classes. Building upon this, we explore methods to effectively aggregate the cost volume considering its multi-modal nature of being established between image and text embeddings. Furthermore, we examine various methods for efficiently fine-tuning CLIP.
Related papers
- A Simple Image Segmentation Framework via In-Context Examples [59.319920526160466]
We present SINE, a simple image framework utilizing in-context examples.
We introduce an In-context Interaction module to complement in-context information and produce correlations between the target image and the in-context example.
Experiments on various segmentation tasks show the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-10-07T08:59:05Z) - Visual Prompt Selection for In-Context Learning Segmentation [77.15684360470152]
In this paper, we focus on rethinking and improving the example selection strategy.
We first demonstrate that ICL-based segmentation models are sensitive to different contexts.
Furthermore, empirical evidence indicates that the diversity of contextual prompts plays a crucial role in guiding segmentation.
arXiv Detail & Related papers (2024-07-14T15:02:54Z) - Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation [28.24883865053459]
This paper aims to learn a model capable of segmenting arbitrary visual concepts within images by using only image-text pairs without dense annotations.
Existing methods have demonstrated that contrastive learning on image-text pairs effectively aligns visual segments with the meanings of texts.
A text often consists of multiple semantic concepts, whereas semantic segmentation strives to create semantically homogeneous segments.
arXiv Detail & Related papers (2024-04-05T17:25:17Z) - Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation [114.72734384299476]
We propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information.
We leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings.
Our approach significantly boosts the capacity of segmentation models for unseen classes.
arXiv Detail & Related papers (2024-03-13T11:23:55Z) - Multi-Grained Cross-modal Alignment for Learning Open-vocabulary
Semantic Segmentation from Text Supervision [23.931443799102663]
We introduce a Multi-Grained Cross-modal Alignment (MGCA) framework to bridge the granularity gap without any dense annotations.
Specifically, MGCA constructs pseudo multi-granular semantic correspondences upon image-text pairs.
Our method achieves significant advancements over state-of-the-art methods, demonstrating its effectiveness and efficiency.
arXiv Detail & Related papers (2024-03-06T13:43:36Z) - Language-guided Few-shot Semantic Segmentation [23.46604057006498]
We propose an innovative solution to tackle the challenge of few-shot semantic segmentation using only language information.
Our approach involves a vision-language-driven mask distillation scheme, which generates high quality pseudo-semantic masks from text prompts.
Experiments on two benchmark datasets demonstrate that our method establishes a new baseline for language-guided few-shot semantic segmentation.
arXiv Detail & Related papers (2023-11-23T09:08:49Z) - ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View
Semantic Consistency [126.88107868670767]
We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation.
We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image.
We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
arXiv Detail & Related papers (2023-01-31T01:57:52Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.