Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual
Mask Annotations
- URL: http://arxiv.org/abs/2303.16891v1
- Date: Wed, 29 Mar 2023 17:58:39 GMT
- Title: Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual
Mask Annotations
- Authors: Vibashan VS, Ning Yu, Chen Xing, Can Qin, Mingfei Gao, Juan Carlos
Niebles, Vishal M. Patel, Ran Xu
- Abstract summary: Open-Vocabulary (OV) methods leverage large-scale image-caption pairs and vision-language models to learn novel categories.
Our method generates pseudo-mask annotations by leveraging the localization ability of a pre-trained vision-language model for objects present in image-caption pairs.
Our method trained with just pseudo-masks significantly improves the mAP scores on the MS-COCO dataset and OpenImages dataset.
- Score: 86.47908754383198
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing instance segmentation models learn task-specific information using
manual mask annotations from base (training) categories. These mask annotations
require tremendous human effort, limiting the scalability to annotate novel
(new) categories. To alleviate this problem, Open-Vocabulary (OV) methods
leverage large-scale image-caption pairs and vision-language models to learn
novel categories. In summary, an OV method learns task-specific information
using strong supervision from base annotations and novel category information
using weak supervision from image-captions pairs. This difference between
strong and weak supervision leads to overfitting on base categories, resulting
in poor generalization towards novel categories. In this work, we overcome this
issue by learning both base and novel categories from pseudo-mask annotations
generated by the vision-language model in a weakly supervised manner using our
proposed Mask-free OVIS pipeline. Our method automatically generates
pseudo-mask annotations by leveraging the localization ability of a pre-trained
vision-language model for objects present in image-caption pairs. The generated
pseudo-mask annotations are then used to supervise an instance segmentation
model, freeing the entire pipeline from any labour-expensive instance-level
annotations and overfitting. Our extensive experiments show that our method
trained with just pseudo-masks significantly improves the mAP scores on the
MS-COCO dataset and OpenImages dataset compared to the recent state-of-the-art
methods trained with manual masks. Codes and models are provided in
https://vibashan.github.io/ovis-web/.
Related papers
- Semantic Refocused Tuning for Open-Vocabulary Panoptic Segmentation [42.020470627552136]
Open-vocabulary panoptic segmentation is an emerging task aiming to accurately segment the image into semantically meaningful masks.
mask classification is the main performance bottleneck for open-vocab panoptic segmentation.
We propose Semantic Refocused Tuning, a novel framework that greatly enhances open-vocab panoptic segmentation.
arXiv Detail & Related papers (2024-09-24T17:50:28Z) - MaskInversion: Localized Embeddings via Optimization of Explainability Maps [49.50785637749757]
MaskInversion generates a context-aware embedding for a query image region specified by a mask at test time.
It can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation.
arXiv Detail & Related papers (2024-07-29T14:21:07Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - What You See is What You Classify: Black Box Attributions [61.998683569022006]
We train a deep network, the Explainer, to predict attributions for a pre-trained black-box classifier, the Explanandum.
Unlike most existing approaches, ours is capable of directly generating very distinct class-specific masks.
We show that our attributions are superior to established methods both visually and quantitatively.
arXiv Detail & Related papers (2022-05-23T12:30:04Z) - ContrastMask: Contrastive Learning to Segment Every Thing [18.265503138997794]
We propose ContrastMask, which learns a mask segmentation model on both seen and unseen categories.
Features from the mask regions (foreground) are pulled together, and are contrasted against those from the background, and vice versa.
Exhaustive experiments on the COCO dataset demonstrate the superiority of our method.
arXiv Detail & Related papers (2022-03-18T07:41:48Z) - Open-Vocabulary Instance Segmentation via Robust Cross-Modal
Pseudo-Labeling [61.03262873980619]
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations.
We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images.
Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
arXiv Detail & Related papers (2021-11-24T18:50:47Z) - Scaling up instance annotation via label propagation [69.8001043244044]
We propose a highly efficient annotation scheme for building large datasets with object segmentation masks.
We exploit these similarities by using hierarchical clustering on mask predictions made by a segmentation model.
We show that we obtain 1M object segmentation masks with a total annotation time of only 290 hours.
arXiv Detail & Related papers (2021-10-05T18:29:34Z) - The surprising impact of mask-head architecture on novel class
segmentation [27.076315496682444]
We show that the architecture of the mask-head plays a surprisingly important role in generalization to classes for which we do not observe masks during training.
We also show that the choice of mask-head architecture alone can lead to SOTA results on the partially supervised COCO benchmark without the need of specialty modules or losses proposed by prior literature.
arXiv Detail & Related papers (2021-04-01T16:46:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.