MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic
Segmentation
- URL: http://arxiv.org/abs/2304.06957v1
- Date: Fri, 14 Apr 2023 07:01:47 GMT
- Title: MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic
Segmentation
- Authors: Jie Guo, Qimeng Wang, Yan Gao, Xiaolong Jiang, Xu Tang, Yao Hu,
Baochang Zhang
- Abstract summary: We first demonstrate the necessity of image-pixel CLIP feature adaption, then provide Multi-View Prompt learning (MVP-SEG)
MVP-SEG is an effective solution to achieve image-pixel adaptation and to solve open-vocabulary semantic segmentation.
Experiments show that the multi-view prompts learned from seen categories have strong generalization to unseen categories.
- Score: 44.28355088831045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: CLIP (Contrastive Language-Image Pretraining) is well-developed for
open-vocabulary zero-shot image-level recognition, while its applications in
pixel-level tasks are less investigated, where most efforts directly adopt CLIP
features without deliberative adaptations. In this work, we first demonstrate
the necessity of image-pixel CLIP feature adaption, then provide Multi-View
Prompt learning (MVP-SEG) as an effective solution to achieve image-pixel
adaptation and to solve open-vocabulary semantic segmentation. Concretely,
MVP-SEG deliberately learns multiple prompts trained by our Orthogonal
Constraint Loss (OCLoss), by which each prompt is supervised to exploit CLIP
feature on different object parts, and collaborative segmentation masks
generated by all prompts promote better segmentation. Moreover, MVP-SEG
introduces Global Prompt Refining (GPR) to further eliminate class-wise
segmentation noise. Experiments show that the multi-view prompts learned from
seen categories have strong generalization to unseen categories, and MVP-SEG+
which combines the knowledge transfer stage significantly outperforms previous
methods on several benchmarks. Moreover, qualitative results justify that
MVP-SEG does lead to better focus on different local parts.
Related papers
- Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels [53.8817160001038]
We propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding.
To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm.
PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods.
arXiv Detail & Related papers (2024-09-30T01:13:03Z) - MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Semantic Segmentation [33.67313662538398]
We propose a multi-resolution training framework for open-vocabulary semantic segmentation with a single pretrained CLIP backbone.
MROVSeg uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder.
We demonstrate the superiority of MROVSeg on well-established open-vocabulary semantic segmentation benchmarks.
arXiv Detail & Related papers (2024-08-27T04:45:53Z) - Visual Prompt Selection for In-Context Learning Segmentation [77.15684360470152]
In this paper, we focus on rethinking and improving the example selection strategy.
We first demonstrate that ICL-based segmentation models are sensitive to different contexts.
Furthermore, empirical evidence indicates that the diversity of contextual prompts plays a crucial role in guiding segmentation.
arXiv Detail & Related papers (2024-07-14T15:02:54Z) - Medical Visual Prompting (MVP): A Unified Framework for Versatile and High-Quality Medical Image Segmentation [15.460598807078751]
We propose a medical visual prompting (MVP) framework that leverages pre-training and prompting concepts from natural language processing (NLP)
The MVP enables the segmentation network to better learn shape prompting information and facilitates mutual learning across different tasks.
This novel framework offers improved performance with fewer parameters and holds significant potential for accurate segmentation of lesion regions in various medical tasks.
arXiv Detail & Related papers (2024-04-01T14:06:48Z) - UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [90.74967596080982]
This paper extends Contrastive Language-Image Pre-training (CLIP) with multi-granularity alignment.
We develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities.
With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks.
arXiv Detail & Related papers (2024-01-12T06:35:09Z) - CLIP Is Also a Good Teacher: A New Learning Framework for Inductive
Zero-shot Semantic Segmentation [6.181169909576527]
Generalized Zero-shot Semantic aims to segment both seen and unseen categories only under the supervision of the seen ones.
Existing methods adopt the large-scale Vision Language Models (VLMs) which obtain outstanding zero-shot performance.
We propose CLIP-ZSS (Zero-shot Semantic), a training framework that enables any image encoder designed for closed-set segmentation applied in zero-shot and open-vocabulary tasks.
arXiv Detail & Related papers (2023-10-03T09:33:47Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z) - Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category.
We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP.
Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z) - Learning to Compose Diversified Prompts for Image Emotion Classification [5.586293129420233]
Contrastive Language-Image Pre-training (CLIP) represents the latest incarnation of pre-trained vision-language models.
CLIP has recently shown its superior power on a wide range of downstream vision-language tasks like Visual Question Answering.
We propose a general framework that shows how CLIP can be effectively applied to Image Emotion Classification.
arXiv Detail & Related papers (2022-01-26T14:31:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.