Open-Vocabulary Instance Segmentation via Robust Cross-Modal
Pseudo-Labeling
- URL: http://arxiv.org/abs/2111.12698v1
- Date: Wed, 24 Nov 2021 18:50:47 GMT
- Title: Open-Vocabulary Instance Segmentation via Robust Cross-Modal
Pseudo-Labeling
- Authors: Dat Huynh, Jason Kuen, Zhe Lin, Jiuxiang Gu, Ehsan Elhamifar
- Abstract summary: Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations.
We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images.
Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
- Score: 61.03262873980619
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary instance segmentation aims at segmenting novel classes
without mask annotations. It is an important step toward reducing laborious
human supervision. Most existing works first pretrain a model on captioned
images covering many novel classes and then finetune it on limited base classes
with mask annotations. However, the high-level textual information learned from
caption pretraining alone cannot effectively encode the details required for
pixel-wise segmentation. To address this, we propose a cross-modal
pseudo-labeling framework, which generates training pseudo masks by aligning
word semantics in captions with visual features of object masks in images.
Thus, our framework is capable of labeling novel classes in captions via their
word semantics to self-train a student model. To account for noises in pseudo
masks, we design a robust student model that selectively distills mask
knowledge by estimating the mask noise levels, hence mitigating the adverse
impact of noisy pseudo masks. By extensive experiments, we show the
effectiveness of our framework, where we significantly improve mAP score by
4.5% on MS-COCO and 5.1% on the large-scale Open Images & Conceptual Captions
datasets compared to the state-of-the-art.
Related papers
- Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels [53.8817160001038]
We propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding.
To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm.
PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods.
arXiv Detail & Related papers (2024-09-30T01:13:03Z) - Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision [87.15580604023555]
Unpair-Seg is a novel weakly-supervised open-vocabulary segmentation framework.
It learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected.
It achieves 14.6% and 19.5% mIoU on the ADE-847 and PASCAL Context-459 datasets.
arXiv Detail & Related papers (2024-02-14T06:01:44Z) - Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual
Mask Annotations [86.47908754383198]
Open-Vocabulary (OV) methods leverage large-scale image-caption pairs and vision-language models to learn novel categories.
Our method generates pseudo-mask annotations by leveraging the localization ability of a pre-trained vision-language model for objects present in image-caption pairs.
Our method trained with just pseudo-masks significantly improves the mAP scores on the MS-COCO dataset and OpenImages dataset.
arXiv Detail & Related papers (2023-03-29T17:58:39Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP [45.81698881151867]
Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training.
Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions.
We propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions.
In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-
arXiv Detail & Related papers (2022-10-09T02:57:32Z) - MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining.
MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z) - GANSeg: Learning to Segment by Unsupervised Hierarchical Image
Generation [16.900404701997502]
We propose a GAN-based approach that generates images conditioned on latent masks.
We show that such mask-conditioned image generation can be learned faithfully when conditioning the masks in a hierarchical manner.
It also lets us generate image-mask pairs for training a segmentation network, which outperforms the state-of-the-art unsupervised segmentation methods on established benchmarks.
arXiv Detail & Related papers (2021-12-02T07:57:56Z) - Few-shot Semantic Image Synthesis Using StyleGAN Prior [8.528384027684192]
We present a training strategy that performs pseudo labeling of semantic masks using the StyleGAN prior.
Our key idea is to construct a simple mapping between the StyleGAN feature and each semantic class from a few examples of semantic masks.
Although the pseudo semantic masks might be too coarse for previous approaches that require pixel-aligned masks, our framework can synthesize high-quality images from not only dense semantic masks but also sparse inputs such as landmarks and scribbles.
arXiv Detail & Related papers (2021-03-27T11:04:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.