CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation
- URL: http://arxiv.org/abs/2505.21904v3
- Date: Sun, 08 Jun 2025 03:09:16 GMT
- Title: CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation
- Authors: Pardis Taghavi, Tian Liu, Renjie Li, Reza Langari, Zhengzhong Tu,
- Abstract summary: We introduce CAST, a semi-supervised knowledge distillation (SSKD) framework that compresses pretrained vision foundation models (VFM) into compact experts.<n>Cast unfolds in three stages: (1) domain adaptation of the VFM teacher(s) via self-training with contrastive pixel calibration, (2) distillation into a compact student via a unified multi-objective loss.<n>On Cityscapes and ADE20K, our 11X smaller student surpasses its adapted VFM teacher(s) by +3.4 AP (33.9 vs. 30.5) and +1.5 AP (16.7 vs. 15.2) and outperforms state-
- Score: 7.478518822890964
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instance segmentation demands costly per-pixel annotations and large models. We introduce CAST, a semi-supervised knowledge distillation (SSKD) framework that compresses pretrained vision foundation models (VFM) into compact experts using limited labeled and abundant unlabeled data. CAST unfolds in three stages: (1) domain adaptation of the VFM teacher(s) via self-training with contrastive pixel calibration, (2) distillation into a compact student via a unified multi-objective loss that couples standard supervision and pseudo-labels with our instance-aware pixel-wise contrastive term, and (3) fine-tuning on labeled data to remove residual pseudo-label bias. Central to CAST is an \emph{instance-aware pixel-wise contrastive loss} that fuses mask and class scores to mine informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and fully leverage unlabeled images. On Cityscapes and ADE20K, our ~11X smaller student surpasses its adapted VFM teacher(s) by +3.4 AP (33.9 vs. 30.5) and +1.5 AP (16.7 vs. 15.2) and outperforms state-of-the-art semi-supervised approaches.
Related papers
- CoMAD: A Multiple-Teacher Self-Supervised Distillation Framework [1.2172320168050466]
We introduce Consensus-oriented Masked Distillation (CoMAD)<n>It unifies knowledge from self-supervised Vision Transformers into a compact student network.<n>On ImageNet-1K, CoMAD's ViT-Tiny achieves 75.4 percent Top-1, an increment of 0.4 percent over the previous state-of-the-art.
arXiv Detail & Related papers (2025-08-06T18:55:14Z) - Pseudo-Label Quality Decoupling and Correction for Semi-Supervised Instance Segmentation [62.55963720723179]
Semi-Supervised Instance (SSIS) involves classifying and grouping image pixels into distinct object instances.<n>This learning paradigm usually faces a significant challenge of unstable performance caused by noisy pseudo-labels of instance categories and pixel masks.<n>We introduce a novel PseudoLabel Quality Decoupling and Correction (PL-DC) framework for tackling the above challenges.
arXiv Detail & Related papers (2025-05-16T10:07:17Z) - Stable Mean Teacher for Semi-supervised Video Action Detection [3.5743998666556855]
We focus on semi-supervised learning for video action detection.<n>We present Stable Mean Teacher, a simple end-to-end teacher-based framework that benefits from improved and temporally consistent pseudo labels.
arXiv Detail & Related papers (2024-12-10T00:25:33Z) - ContraCluster: Learning to Classify without Labels by Contrastive
Self-Supervision and Prototype-Based Semi-Supervision [7.819942809508631]
We propose ContraCluster, an unsupervised image classification method that combines clustering with the power of contrastive self-supervised learning.
ContraCluster consists of three stages: (1) contrastive self-supervised pre-training (CPT), (2) contrastive prototype sampling (CPS), and (3) prototype-based semi-supervised fine-tuning (PB-SFT).
We demonstrate empirically that ContraCluster achieves new state-of-the-art results for standard benchmark datasets including CIFAR-10, STL-10, and ImageNet-10.
arXiv Detail & Related papers (2023-04-19T01:51:08Z) - Mix-Teaching: A Simple, Unified and Effective Semi-Supervised Learning
Framework for Monocular 3D Object Detection [22.074959519526605]
Mix-Teaching is an effective semi-supervised learning framework applicable to employ both labeled and unlabeled images in training stage.
Mix-Teaching consistently improves MonoFlex and GUPNet by significant margins under various labeling ratios on KITTI dataset.
arXiv Detail & Related papers (2022-07-10T12:07:25Z) - mc-BEiT: Multi-choice Discretization for Image BERT Pre-training [52.04866462439979]
Image BERT pre-training with masked image modeling (MIM) is a popular practice to cope with self-supervised representation learning.
We introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives.
arXiv Detail & Related papers (2022-03-29T09:08:18Z) - Adversarial Dual-Student with Differentiable Spatial Warping for
Semi-Supervised Semantic Segmentation [70.2166826794421]
We propose a differentiable geometric warping to conduct unsupervised data augmentation.
We also propose a novel adversarial dual-student framework to improve the Mean-Teacher.
Our solution significantly improves the performance and state-of-the-art results are achieved on both datasets.
arXiv Detail & Related papers (2022-03-05T17:36:17Z) - Weakly Supervised Semantic Segmentation by Pixel-to-Prototype Contrast [43.40192909920495]
Cross-view feature semantic consistency and intra(inter)-class compactness(dispersion) are explored.
We propose two novel pixel-to-prototype contrast regularization terms that are conducted cross different views and within per single view of an image.
Our method can be seamlessly incorporated into existing WSSS models without any changes to the base network.
arXiv Detail & Related papers (2021-10-14T01:44:57Z) - Dense Contrastive Learning for Self-Supervised Visual Pre-Training [102.15325936477362]
We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.
Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only 1% slower)
arXiv Detail & Related papers (2020-11-18T08:42:32Z) - Deep Semi-supervised Knowledge Distillation for Overlapping Cervical
Cell Instance Segmentation [54.49894381464853]
We propose to leverage both labeled and unlabeled data for instance segmentation with improved accuracy by knowledge distillation.
We propose a novel Mask-guided Mean Teacher framework with Perturbation-sensitive Sample Mining.
Experiments show that the proposed method improves the performance significantly compared with the supervised method learned from labeled data only.
arXiv Detail & Related papers (2020-07-21T13:27:09Z) - Un-Mix: Rethinking Image Mixtures for Unsupervised Visual Representation
Learning [108.999497144296]
Recently advanced unsupervised learning approaches use the siamese-like framework to compare two "views" from the same image for learning representations.
This work aims to involve the distance concept on label space in the unsupervised learning and let the model be aware of the soft degree of similarity between positive or negative pairs.
Despite its conceptual simplicity, we show empirically that with the solution -- Unsupervised image mixtures (Un-Mix), we can learn subtler, more robust and generalized representations from the transformed input and corresponding new label space.
arXiv Detail & Related papers (2020-03-11T17:59:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.