CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic
Segmentation For-Free
- URL: http://arxiv.org/abs/2309.14289v2
- Date: Tue, 28 Nov 2023 13:28:24 GMT
- Title: CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic
Segmentation For-Free
- Authors: Monika Wysocza\'nska, Micha\"el Ramamonjisoa, Tomasz Trzci\'nski,
Oriane Sim\'eoni
- Abstract summary: We propose an open-vocabulary semantic segmentation method, dubbed CLIP-DIY.
It exploits CLIP classification abilities on patches of different sizes and aggregates the decision in a single map.
We obtain state-of-the-art zero-shot semantic segmentation results on PASCAL VOC and perform on par with the best methods on COCO.
- Score: 12.15899043709721
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The emergence of CLIP has opened the way for open-world image perception. The
zero-shot classification capabilities of the model are impressive but are
harder to use for dense tasks such as image segmentation. Several methods have
proposed different modifications and learning schemes to produce dense output.
Instead, we propose in this work an open-vocabulary semantic segmentation
method, dubbed CLIP-DIY, which does not require any additional training or
annotations, but instead leverages existing unsupervised object localization
approaches. In particular, CLIP-DIY is a multi-scale approach that directly
exploits CLIP classification abilities on patches of different sizes and
aggregates the decision in a single map. We further guide the segmentation
using foreground/background scores obtained using unsupervised object
localization methods. With our method, we obtain state-of-the-art zero-shot
semantic segmentation results on PASCAL VOC and perform on par with the best
methods on COCO. The code is available at
http://github.com/wysoczanska/clip-diy
Related papers
- Laser: Efficient Language-Guided Segmentation in Neural Radiance Fields [49.66011190843893]
We propose a method that leverages CLIP feature distillation, achieving efficient 3D segmentation through language guidance.
To achieve this, we introduce an adapter module and mitigate the noise issue in the dense CLIP feature distillation process.
Our method surpasses current state-of-the-art technologies in both training speed and performance.
arXiv Detail & Related papers (2025-01-31T12:19:14Z) - Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels [53.8817160001038]
We propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding.
To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm.
PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods.
arXiv Detail & Related papers (2024-09-30T01:13:03Z) - Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation [90.35249276717038]
We propose WeCLIP, a CLIP-based single-stage pipeline, for weakly supervised semantic segmentation.
Specifically, the frozen CLIP model is applied as the backbone for semantic feature extraction.
A new decoder is designed to interpret extracted semantic features for final prediction.
arXiv Detail & Related papers (2024-06-17T03:49:47Z) - CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation [31.264574799748903]
We propose an open-vocabulary semantic segmentation method, which does not require any annotations.
We show that the used self-supervised feature properties can directly be learnt from CLIP features.
Our method CLIP-DINOiser needs only a single forward pass of CLIP and two light convolutional layers at inference.
arXiv Detail & Related papers (2023-12-19T17:40:27Z) - Side Adapter Network for Open-Vocabulary Semantic Segmentation [69.18441687386733]
This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN)
A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias.
Our approach significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed.
arXiv Detail & Related papers (2023-02-23T18:58:28Z) - SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary
Semantic Segmentation [26.079055078561986]
We propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation.
The main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs.
Experimental results show that our model achieves comparable or superior segmentation accuracy.
arXiv Detail & Related papers (2022-11-27T12:38:52Z) - FreeSOLO: Learning to Segment Objects without Annotations [191.82134817449528]
We present FreeSOLO, a self-supervised instance segmentation framework built on top of the simple instance segmentation method SOLO.
Our method also presents a novel localization-aware pre-training framework, where objects can be discovered from complicated scenes in an unsupervised manner.
arXiv Detail & Related papers (2022-02-24T16:31:44Z) - DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition.
DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.
Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z) - Sparse Object-level Supervision for Instance Segmentation with Pixel
Embeddings [4.038011160363972]
Most state-of-the-art instance segmentation methods have to be trained on densely annotated images.
We propose a proposal-free segmentation approach based on non-spatial embeddings.
We evaluate the proposed method on challenging 2D and 3D segmentation problems in different microscopy modalities.
arXiv Detail & Related papers (2021-03-26T16:36:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.