Text-Conditioned Sampling Framework for Text-to-Image Generation with
Masked Generative Models
- URL: http://arxiv.org/abs/2304.01515v1
- Date: Tue, 4 Apr 2023 03:52:49 GMT
- Title: Text-Conditioned Sampling Framework for Text-to-Image Generation with
Masked Generative Models
- Authors: Jaewoong Lee, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Yunji Kim,
Jin-Hwa Kim, Jung-Woo Ha, Sung Ju Hwang
- Abstract summary: We propose a learnable sampling model, Text-Conditioned Token Selection (TCTS), to select optimal tokens via localized supervision with text information.
TCTS improves not only the image quality but also the semantic alignment of the generated images with the given texts.
We validate the efficacy of TCTS combined with Frequency Adaptive Sampling (FAS) with various generative tasks, demonstrating that it significantly outperforms the baselines in image-text alignment and image quality.
- Score: 52.29800567587504
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Token-based masked generative models are gaining popularity for their fast
inference time with parallel decoding. While recent token-based approaches
achieve competitive performance to diffusion-based models, their generation
performance is still suboptimal as they sample multiple tokens simultaneously
without considering the dependence among them. We empirically investigate this
problem and propose a learnable sampling model, Text-Conditioned Token
Selection (TCTS), to select optimal tokens via localized supervision with text
information. TCTS improves not only the image quality but also the semantic
alignment of the generated images with the given texts. To further improve the
image quality, we introduce a cohesive sampling strategy, Frequency Adaptive
Sampling (FAS), to each group of tokens divided according to the self-attention
maps. We validate the efficacy of TCTS combined with FAS with various
generative tasks, demonstrating that it significantly outperforms the baselines
in image-text alignment and image quality. Our text-conditioned sampling
framework further reduces the original inference time by more than 50% without
modifying the original generative model.
Related papers
- Debiasing Vison-Language Models with Text-Only Training [15.069736314663352]
We propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
To address the limitations, we propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
arXiv Detail & Related papers (2024-10-12T04:34:46Z) - Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment [0.7499722271664144]
Contrastive Language and Image Pairing (CLIP) is a transformative method in multimedia retrieval.
CLIP typically trains two neural networks concurrently to generate joint embeddings for text and image pairs.
This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios.
arXiv Detail & Related papers (2024-09-03T14:33:01Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - Variational Distribution Learning for Unsupervised Text-to-Image
Generation [42.3246826401366]
We propose a text-to-image generation algorithm based on deep neural networks when text captions for images are unavailable during training.
We employ a pretrained CLIP model, which is capable of properly aligning embeddings of images and corresponding texts in a joint space.
We optimize a text-to-image generation model by maximizing the data log-likelihood conditioned on pairs of image-text CLIP embeddings.
arXiv Detail & Related papers (2023-03-28T16:18:56Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer [8.069590683507997]
We propose MXQ-VAE, a vector quantization method for multimodal image-text representation.
MXQ-VAE accepts a paired image and text as input, and learns a joint quantized representation space.
We can use autoregressive generative models to model the joint image-text representation, and even perform unconditional image-text pair generation.
arXiv Detail & Related papers (2022-04-15T16:29:55Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - XGPT: Cross-modal Generative Pre-Training for Image Captioning [80.26456233277435]
XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning.
It is designed to pre-train text-to-image caption generators through three novel generation tasks.
XGPT can be fine-tuned without any task-specific architecture modifications.
arXiv Detail & Related papers (2020-03-03T12:13:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.