ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts
- URL: http://arxiv.org/abs/2301.12171v2
- Date: Tue, 30 May 2023 13:46:57 GMT
- Title: ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts
- Authors: Kwanyoung Kim, Yujin Oh, Jong Chul Ye
- Abstract summary: We propose a novel Zero-shot segmentation with Optimal Transport (ZegOT) method.
MPOT is designed to learn an optimal mapping between multiple text prompts and visual feature maps of the frozen image encoder hidden layers.
We show that our method achieves the state-of-the-art (SOTA) performance over existing Zero-shot Semantic-the-art (ZS3) approaches.
- Score: 41.14796120215464
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent success of large-scale Contrastive Language-Image Pre-training (CLIP)
has led to great promise in zero-shot semantic segmentation by transferring
image-text aligned knowledge to pixel-level classification. However, existing
methods usually require an additional image encoder or retraining/tuning the
CLIP module. Here, we propose a novel Zero-shot segmentation with Optimal
Transport (ZegOT) method that matches multiple text prompts with frozen image
embeddings through optimal transport. In particular, we introduce a novel
Multiple Prompt Optimal Transport Solver (MPOT), which is designed to learn an
optimal mapping between multiple text prompts and visual feature maps of the
frozen image encoder hidden layers. This unique mapping method facilitates each
of the multiple text prompts to effectively focus on distinct visual semantic
attributes. Through extensive experiments on benchmark datasets, we show that
our method achieves the state-of-the-art (SOTA) performance over existing
Zero-shot Semantic Segmentation (ZS3) approaches.
Related papers
- Text4Seg: Reimagining Image Segmentation as Text Generation [32.230379277018194]
We introduce Text4Seg, a novel text-as-mask paradigm that casts image segmentation as a text generation problem.
Key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label.
We show that Text4Seg achieves state-of-the-art performance on multiple datasets by fine-tuning different MLLM backbones.
arXiv Detail & Related papers (2024-10-13T14:28:16Z) - Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment [0.7499722271664144]
Contrastive Language and Image Pairing (CLIP) is a transformative method in multimedia retrieval.
CLIP typically trains two neural networks concurrently to generate joint embeddings for text and image pairs.
This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios.
arXiv Detail & Related papers (2024-09-03T14:33:01Z) - OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation [57.84148140637513]
Multi-Prompts Sinkhorn Attention (MPSA) effectively replaces cross-attention mechanisms within Transformer framework in multimodal settings.
OTSeg achieves state-of-the-art (SOTA) performance with significant gains on Zero-Shot Semantic (ZS3) tasks.
arXiv Detail & Related papers (2024-03-21T07:15:37Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.