Related papers: ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation

ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation

URL: http://arxiv.org/abs/2408.06747v1
Date: Tue, 13 Aug 2024 09:10:48 GMT
Title: ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation
Authors: Jingyun Wang, Guoliang Kang,
Abstract summary: We propose to explicitly model and rectify the bias existing in CLIP to facilitate the unsupervised semantic segmentation task. We use a learnable ''Reference'' prompt to encode class-preference bias and a projection of the positional embedding in vision transformer to encode space-preference bias. To make the bias modeling and rectification process meaningful and effective, a contrastive loss based on masked visual features and the text features of different classes is imposed.
Score: 6.012828781329036
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent works utilize CLIP to perform the challenging unsupervised semantic segmentation task where only images without annotations are available. However, we observe that when adopting CLIP to such a pixel-level understanding task, unexpected bias (including class-preference bias and space-preference bias) occurs. Previous works don't explicitly model the bias, which largely constrains the segmentation performance. In this paper, we propose to explicitly model and rectify the bias existing in CLIP to facilitate the unsupervised semantic segmentation task. Specifically, we design a learnable ''Reference'' prompt to encode class-preference bias and a projection of the positional embedding in vision transformer to encode space-preference bias respectively. To avoid interference, two kinds of biases are firstly independently encoded into the Reference feature and the positional feature. Via a matrix multiplication between two features, a bias logit map is generated to explicitly represent two kinds of biases. Then we rectify the logits of CLIP via a simple element-wise subtraction. To make the rectified results smoother and more contextual, we design a mask decoder which takes the feature of CLIP and rectified logits as input and outputs a rectified segmentation mask with the help of Gumbel-Softmax operation. To make the bias modeling and rectification process meaningful and effective, a contrastive loss based on masked visual features and the text features of different classes is imposed. To further improve the segmentation, we distill the knowledge from the rectified CLIP to the advanced segmentation architecture via minimizing our designed mask-guided, feature-guided and text-guided loss terms. Extensive experiments on various benchmarks demonstrate that ReCLIP++ performs favorably against previous SOTAs. The implementation is available at: https://github.com/dogehhh/ReCLIP.

Related papers

Is CLIP ideal? No. Can we fix it? Yes! [30.71718499767702]
Contrastive Language-Image Pre-Training is a popular method for learning multimodal latent spaces with well-organized semantics. Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions. We propose Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models.
arXiv Detail & Related papers (2025-03-10T23:42:04Z)
Approximate Size Targets Are Sufficient for Accurate Semantic Segmentation [52.239136918460616]
Extending binary class tags to approximate relative object-size distributions allows off-the-shelf architectures to solve the segmentation problem. A straightforward zero-avoiding KL-divergence loss for average predictions produces segmentation accuracy comparable to the standard pixel-precise supervision. Our ideas are validated on PASCAL VOC using our new human annotations of approximate object sizes.
arXiv Detail & Related papers (2025-03-10T06:02:13Z)
LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query. We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask. We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z)
SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding [56.079013202051094]
We present SegVG, a novel method transfers the box-level annotation as signals to provide an additional pixel-level supervision for Visual Grounding. This approach allows us to iteratively exploit the annotation as signals for both box-level regression and pixel-level segmentation.
arXiv Detail & Related papers (2024-07-03T15:30:45Z)
Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation [90.35249276717038]
We propose WeCLIP, a CLIP-based single-stage pipeline, for weakly supervised semantic segmentation. Specifically, the frozen CLIP model is applied as the backbone for semantic feature extraction. A new decoder is designed to interpret extracted semantic features for final prediction.
arXiv Detail & Related papers (2024-06-17T03:49:47Z)
Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation [76.40565872257709]
We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning. It is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos. Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS)
arXiv Detail & Related papers (2023-03-17T16:23:36Z)
Side Adapter Network for Open-Vocabulary Semantic Segmentation [69.18441687386733]
This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN) A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias. Our approach significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed.
arXiv Detail & Related papers (2023-02-23T18:58:28Z)
ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation [35.60888272729273]
Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme. While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost. We propose a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level.
arXiv Detail & Related papers (2022-12-07T12:05:00Z)
Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts. We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query. Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z)
Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation [25.070027668717422]
Generalized zero-shot semantic segmentation (GZS3) predicts pixel-wise semantic labels for seen and unseen classes. Most GZS3 methods adopt a generative approach that synthesizes visual features of unseen classes from corresponding semantic ones. We propose a discriminative approach to address limitations in a unified framework.
arXiv Detail & Related papers (2021-08-14T13:33:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.