Global Knowledge Calibration for Fast Open-Vocabulary Segmentation
- URL: http://arxiv.org/abs/2303.09181v2
- Date: Sat, 15 Jul 2023 05:10:22 GMT
- Title: Global Knowledge Calibration for Fast Open-Vocabulary Segmentation
- Authors: Kunyang Han, Yong Liu, Jun Hao Liew, Henghui Ding, Yunchao Wei, Jiajun
Liu, Yitong Wang, Yansong Tang, Yujiu Yang, Jiashi Feng, Yao Zhao
- Abstract summary: We introduce a text diversification strategy that generates a set of synonyms for each training category.
We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP.
Our proposed model achieves robust generalization performance across various datasets.
- Score: 124.74256749281625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in pre-trained vision-language models, such as CLIP, have
enabled the segmentation of arbitrary concepts solely from textual inputs, a
process commonly referred to as open-vocabulary semantic segmentation (OVS).
However, existing OVS techniques confront a fundamental challenge: the trained
classifier tends to overfit on the base classes observed during training,
resulting in suboptimal generalization performance to unseen classes. To
mitigate this issue, recent studies have proposed the use of an additional
frozen pre-trained CLIP for classification. Nonetheless, this approach incurs
heavy computational overheads as the CLIP vision encoder must be repeatedly
forward-passed for each mask, rendering it impractical for real-world
applications. To address this challenge, our objective is to develop a fast OVS
model that can perform comparably or better without the extra computational
burden of the CLIP image encoder during inference. To this end, we propose a
core idea of preserving the generalizable representation when fine-tuning on
known classes. Specifically, we introduce a text diversification strategy that
generates a set of synonyms for each training category, which prevents the
learned representation from collapsing onto specific known category names.
Additionally, we employ a text-guided knowledge distillation method to preserve
the generalizable knowledge of CLIP. Extensive experiments demonstrate that our
proposed model achieves robust generalization performance across various
datasets. Furthermore, we perform a preliminary exploration of open-vocabulary
video segmentation and present a benchmark that can facilitate future
open-vocabulary research in the video domain.
Related papers
- FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation [47.0028071183214]
FrozenSeg is designed to integrate spatial knowledge from a localization foundation model (e.g., SAM) and semantic knowledge extracted from a ViL model (e.g., CLIP)
FrozenSeg advances state-of-the-art results across various segmentation benchmarks, trained exclusively on COCO panoptic data, and tested in a zero-shot manner.
arXiv Detail & Related papers (2024-09-05T13:36:50Z) - Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation [19.20874993309959]
vision-language foundation models, such as CLIP, have showcased remarkable effectiveness in numerous zero-shot image-level tasks.
We propose a baseline for training-free OVSS, termed Neighbour-Aware CLIP (NACLIP)
Our method enforces localization of patches in the self-attention of CLIP's vision transformer which, despite being crucial for dense prediction tasks, has been overlooked in the OVSS literature.
arXiv Detail & Related papers (2024-04-12T01:08:04Z) - Open-Vocabulary Segmentation with Semantic-Assisted Calibration [73.39366775301382]
We study open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with contextual prior of CLIP.
We present a Semantic-assisted CAlibration Network (SCAN) to achieve state-of-the-art performance on open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2023-12-07T07:00:09Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal
Regression [94.28253749970534]
We propose to learn the rank concepts from the rich semantic CLIP latent space.
OrdinalCLIP consists of learnable context tokens and learnable rank embeddings.
Experimental results show that our paradigm achieves competitive performance in general ordinal regression tasks.
arXiv Detail & Related papers (2022-06-06T03:54:53Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.