LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation
- URL: http://arxiv.org/abs/2303.12343v2
- Date: Wed, 23 Aug 2023 19:55:51 GMT
- Title: LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation
- Authors: Koutilya Pnvr, Bharat Singh, Pallabi Ghosh, Behjat Siddiquie, David
Jacobs
- Abstract summary: We present a technique for segmenting real and AI-generated images using latent diffusion models (LDMs) trained on internet-scale datasets.
We show up to 6% improvement over standard baselines for text-to-image segmentation on natural images.
For AI-generated imagery, we show close to 20% improvement compared to state-of-the-art techniques.
- Score: 10.623430999818925
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large-scale pre-training tasks like image classification, captioning, or
self-supervised techniques do not incentivize learning the semantic boundaries
of objects. However, recent generative foundation models built using text-based
latent diffusion techniques may learn semantic boundaries. This is because they
have to synthesize intricate details about all objects in an image based on a
text description. Therefore, we present a technique for segmenting real and
AI-generated images using latent diffusion models (LDMs) trained on
internet-scale datasets. First, we show that the latent space of LDMs (z-space)
is a better input representation compared to other feature representations like
RGB images or CLIP encodings for text-based image segmentation. By training the
segmentation models on the latent z-space, which creates a compressed
representation across several domains like different forms of art, cartoons,
illustrations, and photographs, we are also able to bridge the domain gap
between real and AI-generated images. We show that the internal features of
LDMs contain rich semantic information and present a technique in the form of
LD-ZNet to further boost the performance of text-based segmentation. Overall,
we show up to 6% improvement over standard baselines for text-to-image
segmentation on natural images. For AI-generated imagery, we show close to 20%
improvement compared to state-of-the-art techniques. The project is available
at https://koutilya-pnvr.github.io/LD-ZNet/.
Related papers
- Language Guided Domain Generalized Medical Image Segmentation [68.93124785575739]
Single source domain generalization holds promise for more reliable and consistent image segmentation across real-world clinical settings.
We propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features.
Our approach achieves favorable performance against existing methods in literature.
arXiv Detail & Related papers (2024-04-01T17:48:15Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Exploring Limits of Diffusion-Synthetic Training with Weakly Supervised Semantic Segmentation [16.863038973001483]
This work introduces three techniques for diffusion-synthetic semantic segmentation training.
First, reliability-aware robust training, originally used in weakly supervised learning, helps segmentation with insufficient synthetic mask quality.
Second, large-scale pretraining of whole segmentation models, not only backbones, on synthetic ImageNet-1k-class images with pixel-labels benefits downstream segmentation tasks.
Third, we introduce prompt augmentation, data augmentation to the prompt text set to scale up and diversify training images with a limited text resources.
arXiv Detail & Related papers (2023-09-04T05:34:19Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - Spatial Latent Representations in Generative Adversarial Networks for
Image Generation [0.0]
We define a family of spatial latent spaces for StyleGAN2.
We show that our spaces are effective for image manipulation and encode semantic information well.
arXiv Detail & Related papers (2023-03-25T20:01:11Z) - STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations.
We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space.
It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
arXiv Detail & Related papers (2023-01-30T17:21:30Z) - Learning to Generate Text-grounded Mask for Open-world Semantic
Segmentation from Only Image-Text Pairs [10.484851004093919]
We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images.
Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts.
We propose a novel Text-grounded Contrastive Learning framework that enables a model to directly learn region-text alignment.
arXiv Detail & Related papers (2022-12-01T18:59:03Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Semantic Segmentation with Generative Models: Semi-Supervised Learning
and Strong Out-of-Domain Generalization [112.68171734288237]
We propose a novel framework for discriminative pixel-level tasks using a generative model of both images and labels.
We learn a generative adversarial network that captures the joint image-label distribution and is trained efficiently using a large set of unlabeled images.
We demonstrate strong in-domain performance compared to several baselines, and are the first to showcase extreme out-of-domain generalization.
arXiv Detail & Related papers (2021-04-12T21:41:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.