P+: Extended Textual Conditioning in Text-to-Image Generation
- URL: http://arxiv.org/abs/2303.09522v3
- Date: Sat, 15 Jul 2023 15:48:48 GMT
- Title: P+: Extended Textual Conditioning in Text-to-Image Generation
- Authors: Andrey Voynov, Qinghao Chu, Daniel Cohen-Or, Kfir Aberman
- Abstract summary: We introduce an Extended Textual Conditioning space in text-to-image models, referred to as $P+$.
We show that the extended space provides greater disentangling and control over image synthesis.
We further introduce Extended Textual Inversion (XTI), where the images are inverted into $P+$, and represented by per-layer tokens.
- Score: 50.823884280133626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce an Extended Textual Conditioning space in text-to-image models,
referred to as $P+$. This space consists of multiple textual conditions,
derived from per-layer prompts, each corresponding to a layer of the denoising
U-net of the diffusion model.
We show that the extended space provides greater disentangling and control
over image synthesis. We further introduce Extended Textual Inversion (XTI),
where the images are inverted into $P+$, and represented by per-layer tokens.
We show that XTI is more expressive and precise, and converges faster than
the original Textual Inversion (TI) space. The extended inversion method does
not involve any noticeable trade-off between reconstruction and editability and
induces more regular inversions.
We conduct a series of extensive experiments to analyze and understand the
properties of the new space, and to showcase the effectiveness of our method
for personalizing text-to-image models. Furthermore, we utilize the unique
properties of this space to achieve previously unattainable results in
object-style mixing using text-to-image models. Project page:
https://prompt-plus.github.io
Related papers
- LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation [30.897935761304034]
We propose a novel framework called textbfLLM4GEN, which enhances the semantic understanding of text-to-image diffusion models.
A specially designed Cross-Adapter Module (CAM) integrates the original text features of text-to-image models with LLM features.
DensePrompts, which contains $7,000$ dense prompts, provides a comprehensive evaluation for the text-to-image generation task.
arXiv Detail & Related papers (2024-06-30T15:50:32Z) - Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training [33.51524424536508]
Iterative Prompt Relabeling (IPR) is a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling with feedback.
We conduct thorough experiments on SDv2 and SDXL, testing their capability to follow instructions on spatial relations.
arXiv Detail & Related papers (2023-12-23T11:10:43Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - DeltaSpace: A Semantic-aligned Feature Space for Flexible Text-guided
Image Editing [22.354236929932476]
Text-guided image editing faces significant challenges to training and inference flexibility.
We propose a novel framework called DeltaEdit, which maps the CLIP visual feature differences to the latent space directions of a generative model.
Experiments validate the effectiveness and versatility of DeltaEdit with different generative models.
arXiv Detail & Related papers (2023-10-12T15:43:12Z) - ProSpect: Prompt Spectrum for Attribute-Aware Personalization of
Diffusion Models [77.03361270726944]
Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models.
We propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low to high frequency information.
We apply ProSpect in various personalized attribute-aware image generation applications, such as image-guided or text-driven manipulations of materials, style, and layout.
arXiv Detail & Related papers (2023-05-25T16:32:01Z) - STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations.
We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space.
It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
arXiv Detail & Related papers (2023-01-30T17:21:30Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.