TokenCompose: Text-to-Image Diffusion with Token-level Supervision
- URL: http://arxiv.org/abs/2312.03626v2
- Date: Sun, 23 Jun 2024 23:50:59 GMT
- Title: TokenCompose: Text-to-Image Diffusion with Token-level Supervision
- Authors: Zirui Wang, Zhizhou Sha, Zheng Ding, Yilin Wang, Zhuowen Tu,
- Abstract summary: TokenCompose is a Latent Diffusion Model for text-to-image generation.
It achieves enhanced consistency between user-specified text prompts and model-generated images.
- Score: 43.307556249485366
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present TokenCompose, a Latent Diffusion Model for text-to-image generation that achieves enhanced consistency between user-specified text prompts and model-generated images. Despite its tremendous success, the standard denoising process in the Latent Diffusion Model takes text prompts as conditions only, absent explicit constraint for the consistency between the text prompts and the image contents, leading to unsatisfactory results for composing multiple object categories. TokenCompose aims to improve multi-category instance composition by introducing the token-wise consistency terms between the image content and object segmentation maps in the finetuning stage. TokenCompose can be applied directly to the existing training pipeline of text-conditioned diffusion models without extra human labeling information. By finetuning Stable Diffusion, the model exhibits significant improvements in multi-category instance composition and enhanced photorealism for its generated images. Project link: https://mlpc-ucsd.github.io/TokenCompose
Related papers
- Composable Visual Tokenizers with Generator-Free Diagnostics of Learnability [30.139325285692568]
We introduce CompTok, a training framework for learning visual tokenizers whose tokens are enhanced for compositionality.<n>By employing an InfoGAN-style objective, we train a recognition model to predict the tokens used to condition a diffusion decoder.<n>We show in experiments that CompTok can improve on both of the metrics as well as supporting state-of-the-art generators for class conditioned generation.
arXiv Detail & Related papers (2026-02-03T10:02:51Z) - VSC: Visual Search Compositional Text-to-Image Diffusion Model [15.682990658945682]
We introduce a novel compositional generation method that leverages pairwise image embeddings to improve attribute-object binding.<n>Our approach decomposes complex prompts into sub-prompts, generates corresponding images, and computes visual prototypes that fuse with text embeddings to enhance representation.<n>Our approaches outperform existing compositional text-to-image diffusion models on the benchmark T2I CompBench, achieving better image quality, evaluated by humans, and emerging robustness under scaling number of binding pairs in the prompt.
arXiv Detail & Related papers (2025-05-02T08:31:43Z) - TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space [36.9351027463136]
TokenVerse is a method for multi-concept personalization, leveraging a pre-trained text-to-image diffusion model.
Our framework can disentangle complex visual elements and attributes from as little as a single image.
Our work exploits a DiT-based text-to-image model, in which the input text affects the generation through both attention and modulation.
arXiv Detail & Related papers (2025-01-21T15:49:29Z) - Improving Compositional Attribute Binding in Text-to-Image Generative Models via Enhanced Text Embeddings [46.723653095494896]
We investigate compositional attribute binding failures in text-to-image generative models.
We show that imperfect text conditioning with CLIP text-encoder is one of the primary reasons behind the inability of these models to generate high-fidelity compositional scenes.
Our main finding shows that significant compositional improvements can be achieved without harming the model's FID score.
arXiv Detail & Related papers (2024-06-12T03:21:34Z) - Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models [1.6450779686641077]
We introduce Open-Vocabulary Attention Maps (OVAM)-a training-free method for text-to-image diffusion models.
We evaluate these tokens within existing state-of-the-art Stable Diffusion extensions.
arXiv Detail & Related papers (2024-03-21T10:56:12Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing
with Pre-Trained Diffusion Model [22.975965453227477]
We introduce a new framework called textitPaste, Inpaint and Harmonize via Denoising (PhD)
In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject.
arXiv Detail & Related papers (2023-06-13T07:43:10Z) - Discriminative Class Tokens for Text-to-Image Diffusion Models [107.98436819341592]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text.
Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images.
We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.