Towards Universal Vision-language Omni-supervised Segmentation
- URL: http://arxiv.org/abs/2303.06547v1
- Date: Sun, 12 Mar 2023 02:57:53 GMT
- Title: Towards Universal Vision-language Omni-supervised Segmentation
- Authors: Bowen Dong, Jiaxi Gu, Jianhua Han, Hang Xu, Wangmeng Zuo
- Abstract summary: We present Vision-Language Omni-Supervised (VLOSS) to treat open-world segmentation tasks as proposal classification.
We leverage omni-supervised data (i.e., panoptic segmentation data, object detection data, and image-text pairs data) into training, thus enriching the open-world segmentation ability.
With fewer parameters, our VLOSS with Swin-Tiny surpasses MaskCLIP by 2% in terms of mask AP on LVIS v1 dataset.
- Score: 72.31277932442988
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Existing open-world universal segmentation approaches usually leverage CLIP
and pre-computed proposal masks to treat open-world segmentation tasks as
proposal classification. However, 1) these works cannot handle universal
segmentation in an end-to-end manner, and 2) the limited scale of panoptic
datasets restricts the open-world segmentation ability on things classes. In
this paper, we present Vision-Language Omni-Supervised Segmentation (VLOSS).
VLOSS starts from a Mask2Former universal segmentation framework with CLIP text
encoder. To improve the open-world segmentation ability, we leverage
omni-supervised data (i.e., panoptic segmentation data, object detection data,
and image-text pairs data) into training, thus enriching the open-world
segmentation ability and achieving better segmentation accuracy. To better
improve the training efficiency and fully release the power of omni-supervised
data, we propose several advanced techniques, i.e., FPN-style encoder,
switchable training technique, and positive classification loss. Benefiting
from the end-to-end training manner with proposed techniques, VLOSS can be
applied to various open-world segmentation tasks without further adaptation.
Experimental results on different open-world panoptic and instance segmentation
benchmarks demonstrate the effectiveness of VLOSS. Notably, with fewer
parameters, our VLOSS with Swin-Tiny backbone surpasses MaskCLIP by ~2% in
terms of mask AP on LVIS v1 dataset.
Related papers
- A Lightweight Clustering Framework for Unsupervised Semantic
Segmentation [28.907274978550493]
Unsupervised semantic segmentation aims to categorize each pixel in an image into a corresponding class without the use of annotated data.
We propose a lightweight clustering framework for unsupervised semantic segmentation.
Our framework achieves state-of-the-art results on PASCAL VOC and MS COCO datasets.
arXiv Detail & Related papers (2023-11-30T15:33:42Z) - Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models.
ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image.
Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z) - OpenVIS: Open-vocabulary Video Instance Segmentation [24.860711503327323]
Open-vocabulary Video Instance (OpenVIS) can simultaneously detect, segment, and track arbitrary object categories in a video.
We propose InstFormer, a framework that achieves powerful open-vocabulary capabilities through lightweight fine-tuning with limited-category data.
arXiv Detail & Related papers (2023-05-26T11:25:59Z) - Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category.
We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP.
Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z) - A Simple Framework for Open-Vocabulary Segmentation and Detection [85.21641508535679]
We present OpenSeeD, a simple Open-vocabulary and Detection framework that jointly learns from different segmentation and detection datasets.
We first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them.
After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection.
arXiv Detail & Related papers (2023-03-14T17:58:34Z) - Open-world Instance Segmentation: Top-down Learning with Bottom-up Supervision [83.57156368908836]
We propose a novel approach for open world instance segmentation called bottom-Up and top-Down Open-world (UDOS)
UDOS first predicts parts of objects using a top-down network trained with weak supervision from bottom-up segmentations.
UDOS enjoys both the speed and efficiency from the topdown architectures and the ability to unseen categories from bottom-up supervision.
arXiv Detail & Related papers (2023-03-09T18:55:03Z) - Open-world Semantic Segmentation via Contrasting and Clustering
Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations.
Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.