SSR: Semantic and Spatial Rectification for CLIP-based Weakly Supervised Segmentation
- URL: http://arxiv.org/abs/2512.01701v1
- Date: Mon, 01 Dec 2025 14:06:50 GMT
- Title: SSR: Semantic and Spatial Rectification for CLIP-based Weakly Supervised Segmentation
- Authors: Xiuli Bi, Die Xiao, Junchao Fan, Bin Xiao,
- Abstract summary: This paper proposes a novel Semantic and Spatial Rectification (SSR) method to address the limitations of existing CLIP-based weakly supervised semantic segmentation approaches.<n>Specifically, at the semantic level, the Cross-Modal Prototype Alignment (CMPA) establishes a contrastive learning mechanism to enforce feature space alignment.<n>At the spatial level, the Superpixel-Guided Correction (SGC) leverages superpixel-based spatial priors to precisely filter out interference from non-target regions during affinity propagation.
- Score: 19.962252347191523
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, Contrastive Language-Image Pretraining (CLIP) has been widely applied to Weakly Supervised Semantic Segmentation (WSSS) tasks due to its powerful cross-modal semantic understanding capabilities. This paper proposes a novel Semantic and Spatial Rectification (SSR) method to address the limitations of existing CLIP-based weakly supervised semantic segmentation approaches: over-activation in non-target foreground regions and background areas. Specifically, at the semantic level, the Cross-Modal Prototype Alignment (CMPA) establishes a contrastive learning mechanism to enforce feature space alignment across modalities, reducing inter-class overlap while enhancing semantic correlations, to rectify over-activation in non-target foreground regions effectively; at the spatial level, the Superpixel-Guided Correction (SGC) leverages superpixel-based spatial priors to precisely filter out interference from non-target regions during affinity propagation, significantly rectifying background over-activation. Extensive experiments on the PASCAL VOC and MS COCO datasets demonstrate that our method outperforms all single-stage approaches, as well as more complex multi-stage approaches, achieving mIoU scores of 79.5% and 50.6%, respectively.
Related papers
- HarmoCLIP: Harmonizing Global and Regional Representations in Contrastive Vision-Language Models [63.87966115136411]
HarmoCLIP is a novel framework designed to harmonize global and region representations within Contrastive Language-Image Pre-training.<n>To strengthen the representation capability at the local level, our method introduces a novel Region-Language Alignment supervision strategy.
arXiv Detail & Related papers (2025-11-27T16:24:53Z) - Adaptive Spatial Augmentation for Semi-supervised Semantic Segmentation [51.645152962504056]
In semi-supervised semantic segmentation, data augmentation plays a crucial role in the weak-to-strong consistency regularization framework.<n>We show that spatial augmentation can contribute to model training in SSSS, despite generating inconsistent masks between the weak and strong augmentations.<n>We propose an adaptive augmentation strategy that dynamically adjusts the augmentation for each instance based on entropy.
arXiv Detail & Related papers (2025-05-29T13:35:48Z) - ResCLIP: Residual Attention for Training-free Dense Vision-language Inference [27.551367463011008]
Cross-correlation of self-attention in CLIP's non-final layers also exhibits localization properties.
We propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block.
The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference.
arXiv Detail & Related papers (2024-11-24T14:14:14Z) - DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation [8.422110274212503]
Weakly supervised semantic segmentation approaches typically rely on class activation maps (CAMs) for initial seed generation.
We introduce DALNet, which leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity.
Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
arXiv Detail & Related papers (2024-09-24T06:51:49Z) - Multi-Grained Cross-modal Alignment for Learning Open-vocabulary
Semantic Segmentation from Text Supervision [23.931443799102663]
We introduce a Multi-Grained Cross-modal Alignment (MGCA) framework to bridge the granularity gap without any dense annotations.
Specifically, MGCA constructs pseudo multi-granular semantic correspondences upon image-text pairs.
Our method achieves significant advancements over state-of-the-art methods, demonstrating its effectiveness and efficiency.
arXiv Detail & Related papers (2024-03-06T13:43:36Z) - Spatial Structure Constraints for Weakly Supervised Semantic
Segmentation [100.0316479167605]
A class activation map (CAM) can only locate the most discriminative part of objects.
We propose spatial structure constraints (SSC) for weakly supervised semantic segmentation to alleviate the unwanted object over-activation of attention expansion.
Our approach achieves 72.7% and 47.0% mIoU on the PASCAL VOC 2012 and COCO datasets, respectively.
arXiv Detail & Related papers (2024-01-20T05:25:25Z) - Progressive Feature Self-reinforcement for Weakly Supervised Semantic
Segmentation [55.69128107473125]
We propose a single-stage approach for Weakly Supervised Semantic (WSSS) with image-level labels.
We adaptively partition the image content into deterministic regions (e.g., confident foreground and background) and uncertain regions (e.g., object boundaries and misclassified categories) for separate processing.
Building upon this, we introduce a complementary self-enhancement method that constrains the semantic consistency between these confident regions and an augmented image with the same class labels.
arXiv Detail & Related papers (2023-12-14T13:21:52Z) - Region-level Contrastive and Consistency Learning for Semi-Supervised
Semantic Segmentation [30.1884540364192]
We propose a novel region-level contrastive and consistency learning framework (RC2L) for semi-supervised semantic segmentation.
Specifically, we first propose a Region Mask Contrastive (RMC) loss and a Region Feature Contrastive (RFC) loss to accomplish region-level contrastive property.
Based on the proposed region-level contrastive and consistency regularization, we develop a region-level contrastive and consistency learning framework (RC2L) for semi-supervised semantic segmentation.
arXiv Detail & Related papers (2022-04-28T07:22:47Z) - Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains.
We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z) - Margin Preserving Self-paced Contrastive Learning Towards Domain
Adaptation for Medical Image Segmentation [51.93711960601973]
We propose a novel margin preserving self-paced contrastive Learning model for cross-modal medical image segmentation.
With the guidance of progressively refined semantic prototypes, a novel margin preserving contrastive loss is proposed to boost the discriminability of embedded representation space.
Experiments on cross-modal cardiac segmentation tasks demonstrate that MPSCL significantly improves semantic segmentation performance.
arXiv Detail & Related papers (2021-03-15T15:23:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.