Related papers: CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation

CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation

URL: http://arxiv.org/abs/2411.10086v3
Date: Fri, 01 Aug 2025 08:25:34 GMT
Title: CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation
Authors: Dengke Zhang, Fagui Liu, Quan Tang,
Abstract summary: Contrastive Language-Image Pre-training (CLIP) struggles to align image patches with category embeddings because of its incoherent patch correlations.<n>This study reveals that inter-class correlations are the main reason for impairing CLIP's segmentation performance.<n>We propose CorrCLIP, which reconstructs the scope and value of patch correlations.
Score: 6.356330972370584
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open-vocabulary semantic segmentation aims to assign semantic labels to each pixel without being constrained by a predefined set of categories. While Contrastive Language-Image Pre-training (CLIP) excels in zero-shot classification, it struggles to align image patches with category embeddings because of its incoherent patch correlations. This study reveals that inter-class correlations are the main reason for impairing CLIP's segmentation performance. Accordingly, we propose CorrCLIP, which reconstructs the scope and value of patch correlations. Specifically, CorrCLIP leverages the Segment Anything Model (SAM) to define the scope of patch interactions, reducing inter-class correlations. To mitigate the problem that SAM-generated masks may contain patches belonging to different classes, CorrCLIP incorporates self-supervised models to compute coherent similarity values, suppressing the weight of inter-class correlations. Additionally, we introduce two additional branches to strengthen patch features' spatial details and semantic representation. Finally, we update segmentation maps with SAM-generated masks to improve spatial consistency. Based on the improvement across patch correlations, feature representations, and segmentation maps, CorrCLIP achieves superior performance across eight benchmarks. Codes are available at: https://github.com/zdk258/CorrCLIP.

Related papers

Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection [54.21851618853518]
We present a concise yet effective approach called Patch Generation-to-Selection to enhance CLIP's training efficiency. Our approach, CLIP-PGS, sets new state-of-the-art results in zero-shot classification and retrieval tasks.
arXiv Detail & Related papers (2025-03-21T12:10:38Z)
Is CLIP ideal? No. Can we fix it? Yes! [30.71718499767702]
Contrastive Language-Image Pre-Training is a popular method for learning multimodal latent spaces with well-organized semantics.<n>Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions.<n>We propose Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models.
arXiv Detail & Related papers (2025-03-10T23:42:04Z)
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference [32.852004564832455]
We re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. We propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-07-17T09:52:20Z)
Semantic Compositions Enhance Vision-Language Contrastive Learning [46.985865191341944]
We show that the zero-shot classification and retrieval capabilities of CLIP-like models can be improved significantly through the introduction of semantically composite examples during pretraining. Our method fuses the captions and blends 50% of each image to form a new composite sample. The benefits of CLIP-C are particularly pronounced in settings with relatively limited pretraining data.
arXiv Detail & Related papers (2024-07-01T15:58:20Z)
SemPLeS: Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation [36.41778553250247]
Weakly-Supervised Semantic (WSSS) aims to train segmentation models using image data with only image-level supervision. We propose a Semantic Prompt Learning for WSSS (SemPLeS) framework, which learns to effectively prompt the CLIP latent space. SemPLeS can perform better semantic alignment between object regions and the associated class labels.
arXiv Detail & Related papers (2024-01-22T09:41:05Z)
TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training [29.431698321195814]
Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification. CLIP shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class. We propose a local-to-global framework to obtain image tags.
arXiv Detail & Related papers (2023-12-20T08:15:40Z)
MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation [110.09800389100599]
We propose MixReorg, a novel and straightforward pre-training paradigm for semantic segmentation. Our approach involves generating fine-grained patch-text pairs data by mixing image patches while preserving the correspondence between patches and text. With MixReorg as a mask learner, conventional text-supervised semantic segmentation models can achieve highly generalizable pixel-semantic alignment ability.
arXiv Detail & Related papers (2023-08-09T09:35:16Z)
CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation [73.89509052503222]
This paper presents a simple but performant semi-supervised semantic segmentation approach, called CorrMatch. We observe that the correlation maps not only enable clustering pixels of the same category easily but also contain good shape information. We propose to conduct pixel propagation by modeling the pairwise similarities of pixels to spread the high-confidence pixels and dig out more. Then, we perform region propagation to enhance the pseudo labels with accurate class-agnostic masks extracted from the correlation maps.
arXiv Detail & Related papers (2023-06-07T10:02:29Z)
Advancing Incremental Few-shot Semantic Segmentation via Semantic-guided Relation Alignment and Adaptation [98.51938442785179]
Incremental few-shot semantic segmentation aims to incrementally extend a semantic segmentation model to novel classes. This task faces a severe semantic-aliasing issue between base and novel classes due to data imbalance. We propose the Semantic-guided Relation Alignment and Adaptation (SRAA) method that fully considers the guidance of prior semantic information.
arXiv Detail & Related papers (2023-05-18T10:40:52Z)
TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation [53.974228542090046]
Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks. Existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes. We propose TagCLIP (Trusty-aware guided CLIP) to address this issue.
arXiv Detail & Related papers (2023-04-15T12:52:23Z)
Triplet Contrastive Learning for Unsupervised Vehicle Re-identification [55.445358749042384]
Part feature learning is a critical technology for fine semantic understanding in vehicle re-identification. We propose a novel Triplet Contrastive Learning framework (TCL) which leverages cluster features to bridge the part features and global features.
arXiv Detail & Related papers (2023-01-23T15:52:12Z)
FECANet: Boosting Few-Shot Semantic Segmentation with Feature-Enhanced Context-Aware Network [48.912196729711624]
Few-shot semantic segmentation is the task of learning to locate each pixel of a novel class in a query image with only a few annotated support images. We propose a Feature-Enhanced Context-Aware Network (FECANet) to suppress the matching noise caused by inter-class local similarity. In addition, we propose a novel correlation reconstruction module that encodes extra correspondence relations between foreground and background and multi-scale context semantic features.
arXiv Detail & Related papers (2023-01-19T16:31:13Z)
Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning [82.70453633641466]
We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility function for CLIP's contrastive loss. We show that PACL is also applicable to image-level predictions and when used with a CLIP backbone, provides a general improvement in zero-shot classification accuracy.
arXiv Detail & Related papers (2022-12-09T17:23:00Z)
A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation. In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP. Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
Causal Intervention for Weakly-Supervised Semantic Segmentation [122.1846968696862]
We aim to generate better pixel-level pseudo-masks by using only image-level labels. We propose a structural causal model to analyze the causalities among images, contexts, and class labels. Based on it, we develop a new method: Context Adjustment (CONTA), to remove the confounding bias in image-level classification.
arXiv Detail & Related papers (2020-09-26T09:26:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.