Masked Momentum Contrastive Learning for Zero-shot Semantic
Understanding
- URL: http://arxiv.org/abs/2308.11448v1
- Date: Tue, 22 Aug 2023 13:55:57 GMT
- Title: Masked Momentum Contrastive Learning for Zero-shot Semantic
Understanding
- Authors: Jiantao Wu and Shentong Mo and Muhammad Awais and Sara Atito and
Zhenhua Feng and Josef Kittler
- Abstract summary: Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data.
This study endeavours to evaluate the effectiveness of pure self-supervised learning (SSL) techniques in computer vision tasks.
- Score: 39.424931953675994
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised pretraining (SSP) has emerged as a popular technique in
machine learning, enabling the extraction of meaningful feature representations
without labelled data. In the realm of computer vision, pretrained vision
transformers (ViTs) have played a pivotal role in advancing transfer learning.
Nonetheless, the escalating cost of finetuning these large models has posed a
challenge due to the explosion of model size. This study endeavours to evaluate
the effectiveness of pure self-supervised learning (SSL) techniques in computer
vision tasks, obviating the need for finetuning, with the intention of
emulating human-like capabilities in generalisation and recognition of unseen
objects. To this end, we propose an evaluation protocol for zero-shot
segmentation based on a prompting patch. Given a point on the target object as
a prompt, the algorithm calculates the similarity map between the selected
patch and other patches, upon that, a simple thresholding is applied to segment
the target. Another evaluation is intra-object and inter-object similarity to
gauge discriminatory ability of SSP ViTs. Insights from zero-shot segmentation
from prompting and discriminatory abilities of SSP led to the design of a
simple SSP approach, termed MMC. This approaches combines Masked image
modelling for encouraging similarity of local features, Momentum based
self-distillation for transferring semantics from global to local features, and
global Contrast for promoting semantics of global features, to enhance
discriminative representations of SSP ViTs. Consequently, our proposed method
significantly reduces the overlap of intra-object and inter-object
similarities, thereby facilitating effective object segmentation within an
image. Our experiments reveal that MMC delivers top-tier results in zero-shot
semantic segmentation across various datasets.
Related papers
- An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Weakly-Supervised Concealed Object Segmentation with SAM-based Pseudo
Labeling and Multi-scale Feature Grouping [40.07070188661184]
Weakly-Supervised Concealed Object (WSCOS) aims to segment objects well blended with surrounding environments.
It is hard to distinguish concealed objects from the background due to the intrinsic similarity.
We propose a new WSCOS method to address these two challenges.
arXiv Detail & Related papers (2023-05-18T14:31:34Z) - De-coupling and De-positioning Dense Self-supervised Learning [65.56679416475943]
Dense Self-Supervised Learning (SSL) methods address the limitations of using image-level feature representations when handling images with multiple objects.
We show that they suffer from coupling and positional bias, which arise from the receptive field increasing with layer depth and zero-padding.
We demonstrate the benefits of our method on COCO and on a new challenging benchmark, OpenImage-MINI, for object classification, semantic segmentation, and object detection.
arXiv Detail & Related papers (2023-03-29T18:07:25Z) - Self-supervised Pre-training for Semantic Segmentation in an Indoor
Scene [8.357801312689622]
We propose RegConsist, a method for self-supervised pre-training of a semantic segmentation model.
We use a variant of contrastive learning to train a DCNN model for predicting semantic segmentation from RGB views in the target environment.
The proposed method outperforms models pre-trained on ImageNet and achieves competitive performance when using models that are trained for exactly the same task but on a different dataset.
arXiv Detail & Related papers (2022-10-04T20:10:14Z) - UniVIP: A Unified Framework for Self-Supervised Visual Pre-training [50.87603616476038]
We propose a novel self-supervised framework to learn versatile visual representations on either single-centric-object or non-iconic dataset.
Massive experiments show that UniVIP pre-trained on non-iconic COCO achieves state-of-the-art transfer performance.
Our method can also exploit single-centric-object dataset such as ImageNet and outperforms BYOL by 2.5% with the same pre-training epochs in linear probing.
arXiv Detail & Related papers (2022-03-14T10:04:04Z) - SSA: Semantic Structure Aware Inference for Weakly Pixel-Wise Dense
Predictions without Cost [36.27226683586425]
The semantic structure aware inference (SSA) is proposed to explore the semantic structure information hidden in different stages of the CNN-based network to generate high-quality CAM in the model inference.
The proposed method has the advantage of no parameters and does not need to be trained. Therefore, it can be applied to a wide range of weakly-supervised pixel-wise dense prediction tasks.
arXiv Detail & Related papers (2021-11-05T11:07:21Z) - Improving Few-shot Learning by Spatially-aware Matching and
CrossTransformer [116.46533207849619]
We study the impact of scale and location mismatch in the few-shot learning scenario.
We propose a novel Spatially-aware Matching scheme to effectively perform matching across multiple scales and locations.
arXiv Detail & Related papers (2020-01-06T14:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.