Selective Masking based Self-Supervised Learning for Image Semantic Segmentation
- URL: http://arxiv.org/abs/2512.06981v1
- Date: Sun, 07 Dec 2025 20:21:26 GMT
- Title: Selective Masking based Self-Supervised Learning for Image Semantic Segmentation
- Authors: Yuemin Wang, Ian Stavness,
- Abstract summary: We show that our proposed selective masking method outperforms the traditional random masking method and supervised ImageNet pretraining on downstream segmentation accuracy.<n>Our proposed Selective Masking Image Reconstruction method provides an effective and practical solution to improve end-to-end semantic segmentation.
- Score: 3.2190659800523562
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes a novel self-supervised learning method for semantic segmentation using selective masking image reconstruction as the pretraining task. Our proposed method replaces the random masking augmentation used in most masked image modelling pretraining methods. The proposed selective masking method selectively masks image patches with the highest reconstruction loss by breaking the image reconstruction pretraining into iterative steps to leverage the trained model's knowledge. We show on two general datasets (Pascal VOC and Cityscapes) and two weed segmentation datasets (Nassar 2020 and Sugarbeets 2016) that our proposed selective masking method outperforms the traditional random masking method and supervised ImageNet pretraining on downstream segmentation accuracy by 2.9% for general datasets and 2.5% for weed segmentation datasets. Furthermore, we found that our selective masking method significantly improves accuracy for the lowest-performing classes. Lastly, we show that using the same pretraining and downstream dataset yields the best result for low-budget self-supervised pretraining. Our proposed Selective Masking Image Reconstruction method provides an effective and practical solution to improve end-to-end semantic segmentation workflows, especially for scenarios that require limited model capacity to meet inference speed and computational resource requirements.
Related papers
- Evolved Hierarchical Masking for Self-Supervised Learning [49.77271430882176]
Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training.<n>This paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning.
arXiv Detail & Related papers (2025-04-12T09:40:14Z) - Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection [54.21851618853518]
We present a concise yet effective approach called Patch Generation-to-Selection to enhance CLIP's training efficiency.<n>Our approach, CLIP-PGS, sets new state-of-the-art results in zero-shot classification and retrieval tasks.
arXiv Detail & Related papers (2025-03-21T12:10:38Z) - Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation [42.020470627552136]
Open-vocabulary segmentation is primarily bottlenecked by mask classification, not mask generation.<n>We propose a novel Fine-grained Semantic Adaptation (FISA) method to address this limitation.<n>FISA enhances the extracted visual features with fine-grained semantic awareness by explicitly integrating this crucial semantic information early in the visual encoding process.
arXiv Detail & Related papers (2024-09-24T17:50:28Z) - Salience-Based Adaptive Masking: Revisiting Token Dynamics for Enhanced Pre-training [33.39585710223628]
Saliency-Based Adaptive Masking improves pre-training performance of MIM approaches by prioritizing token salience.
We show that our method significantly improves over the state-of-the-art in mask-based pre-training on the ImageNet-1K dataset.
arXiv Detail & Related papers (2024-04-12T08:38:51Z) - Boosting Unsupervised Segmentation Learning [37.63303434543399]
We present two practical improvement techniques for unsupervised segmentation learning.<n>First, we leverage image post-processing techniques such as guided filtering to refine the output masks.<n>Second, we introduce a multi-scale consistency criterion, based on a teacher-student training scheme.
arXiv Detail & Related papers (2024-04-04T11:49:56Z) - Data-efficient Event Camera Pre-training via Disentangled Masked
Modeling [20.987277885575963]
We present a new data-supervised voxel-based self-supervised learning method for event cameras.
Our method overcomes the limitations of previous methods, which either sacrifice temporal information or directly employ paired image data.
It exhibits excellent generalization performance and demonstrates significant improvements across various tasks with fewer parameters and lower computational costs.
arXiv Detail & Related papers (2024-03-01T10:02:25Z) - Variance-insensitive and Target-preserving Mask Refinement for
Interactive Image Segmentation [68.16510297109872]
Point-based interactive image segmentation can ease the burden of mask annotation in applications such as semantic segmentation and image editing.
We introduce a novel method, Variance-Insensitive and Target-Preserving Mask Refinement to enhance segmentation quality with fewer user inputs.
Experiments on GrabCut, Berkeley, SBD, and DAVIS datasets demonstrate our method's state-of-the-art performance in interactive image segmentation.
arXiv Detail & Related papers (2023-12-22T02:31:31Z) - DPPMask: Masked Image Modeling with Determinantal Point Processes [49.65141962357528]
Masked Image Modeling (MIM) has achieved impressive representative performance with the aim of reconstructing randomly masked images.
We show that uniformly random masking widely used in previous works unavoidably loses some key objects and changes original semantic information.
To address this issue, we augment MIM with a new masking strategy namely the DPPMask.
Our method is simple yet effective and requires no extra learnable parameters when implemented within various frameworks.
arXiv Detail & Related papers (2023-03-13T13:40:39Z) - Investigating and Simplifying Masking-based Saliency Methods for Model
Interpretability [5.387323728379395]
Saliency maps that identify the most informative regions of an image are valuable for model interpretability.
A common approach to creating saliency maps involves generating input masks that mask out portions of an image.
We show that a masking model can be trained with as few as 10 examples per class and still generate saliency maps with only a 0.7-point increase in localization error.
arXiv Detail & Related papers (2020-10-19T18:00:36Z) - Deep Semi-supervised Knowledge Distillation for Overlapping Cervical
Cell Instance Segmentation [54.49894381464853]
We propose to leverage both labeled and unlabeled data for instance segmentation with improved accuracy by knowledge distillation.
We propose a novel Mask-guided Mean Teacher framework with Perturbation-sensitive Sample Mining.
Experiments show that the proposed method improves the performance significantly compared with the supervised method learned from labeled data only.
arXiv Detail & Related papers (2020-07-21T13:27:09Z) - Masking as an Efficient Alternative to Finetuning for Pretrained
Language Models [49.64561153284428]
We learn selective binary masks for pretrained weights in lieu of modifying them through finetuning.
In intrinsic evaluations, we show that representations computed by masked language models encode information necessary for solving downstream tasks.
arXiv Detail & Related papers (2020-04-26T15:03:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.