Open-Vocabulary Segmentation with Semantic-Assisted Calibration
- URL: http://arxiv.org/abs/2312.04089v1
- Date: Thu, 7 Dec 2023 07:00:09 GMT
- Title: Open-Vocabulary Segmentation with Semantic-Assisted Calibration
- Authors: Yong Liu, Sule Bai, Guanbin Li, Yitong Wang, Yansong Tang
- Abstract summary: We study open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with contextual prior of CLIP.
We present a Semantic-assisted CAlibration Network (SCAN) to achieve state-of-the-art performance on open-vocabulary segmentation benchmarks.
- Score: 73.39366775301382
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies open-vocabulary segmentation (OVS) through calibrating
in-vocabulary and domain-biased embedding space with generalized contextual
prior of CLIP. As the core of open-vocabulary understanding, alignment of
visual content with the semantics of unbounded text has become the bottleneck
of this field. To address this challenge, recent works propose to utilize CLIP
as an additional classifier and aggregate model predictions with CLIP
classification results. Despite their remarkable progress, performance of OVS
methods in relevant scenarios is still unsatisfactory compared with supervised
counterparts. We attribute this to the in-vocabulary embedding and
domain-biased CLIP prediction. To this end, we present a Semantic-assisted
CAlibration Network (SCAN). In SCAN, we incorporate generalized semantic prior
of CLIP into proposal embedding to avoid collapsing on known categories.
Besides, a contextual shift strategy is applied to mitigate the lack of global
context and unnatural background noise. With above designs, SCAN achieves
state-of-the-art performance on all popular open-vocabulary segmentation
benchmarks. Furthermore, we also focus on the problem of existing evaluation
system that ignores semantic duplication across categories, and propose a new
metric called Semantic-Guided IoU (SG-IoU).
Related papers
- ResCLIP: Residual Attention for Training-free Dense Vision-language Inference [27.551367463011008]
Cross-correlation of self-attention in CLIP's non-final layers also exhibits localization properties.
We propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block.
The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference.
arXiv Detail & Related papers (2024-11-24T14:14:14Z) - ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference [32.852004564832455]
We re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality.
We propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-07-17T09:52:20Z) - Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation [38.16802763051431]
We propose CLIPtrase, a training-free semantic segmentation strategy.
It enhances local feature awareness through recalibrated self-correlation among patches.
Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation benchmarks.
arXiv Detail & Related papers (2024-07-11T08:12:16Z) - EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment [28.983503845298824]
We propose Early Dense Alignment (EDA) to bridge the gap between generalizable local semantics and object-level prediction.
In EDA, we use object-level supervision to learn the dense-level rather than object-level alignment to maintain the local fine-grained semantics.
arXiv Detail & Related papers (2023-09-03T12:04:14Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Advancing Incremental Few-shot Semantic Segmentation via Semantic-guided
Relation Alignment and Adaptation [98.51938442785179]
Incremental few-shot semantic segmentation aims to incrementally extend a semantic segmentation model to novel classes.
This task faces a severe semantic-aliasing issue between base and novel classes due to data imbalance.
We propose the Semantic-guided Relation Alignment and Adaptation (SRAA) method that fully considers the guidance of prior semantic information.
arXiv Detail & Related papers (2023-05-18T10:40:52Z) - Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category.
We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP.
Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z) - Context-aware Fine-tuning of Self-supervised Speech Models [56.95389222319555]
We study the use of context, i.e., surrounding segments, during fine-tuning.
We propose a new approach called context-aware fine-tuning.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks.
arXiv Detail & Related papers (2022-12-16T15:46:15Z) - DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition.
DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.
Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.