Translation of Text Embedding via Delta Vector to Suppress Strongly Entangled Content in Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2508.10407v2
- Date: Mon, 18 Aug 2025 06:07:34 GMT
- Title: Translation of Text Embedding via Delta Vector to Suppress Strongly Entangled Content in Text-to-Image Diffusion Models
- Authors: Eunseo Koh, Seunghoo Hong, Tae-Young Kim, Simon S. Woo, Jae-Pil Heo,
- Abstract summary: We propose a novel approach to suppress content that is strongly entangled with specific words.<n>Our method modifies the text embedding to weaken the influence of undesired content in the generated image.<n>Our approach significantly outperforms existing methods in terms of quantitative and qualitative metrics.
- Score: 29.12099884112892
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-Image (T2I) diffusion models have made significant progress in generating diverse high-quality images from textual prompts. However, these models still face challenges in suppressing content that is strongly entangled with specific words. For example, when generating an image of "Charlie Chaplin", a "mustache" consistently appears even if explicitly instructed not to include it, as the concept of "mustache" is strongly entangled with "Charlie Chaplin". To address this issue, we propose a novel approach to directly suppress such entangled content within the text embedding space of diffusion models. Our method introduces a delta vector that modifies the text embedding to weaken the influence of undesired content in the generated image, and we further demonstrate that this delta vector can be easily obtained through a zero-shot approach. Furthermore, we propose a Selective Suppression with Delta Vector (SSDV) method to adapt delta vector into the cross-attention mechanism, enabling more effective suppression of unwanted content in regions where it would otherwise be generated. Additionally, we enabled more precise suppression in personalized T2I models by optimizing delta vector, which previous baselines were unable to achieve. Extensive experimental results demonstrate that our approach significantly outperforms existing methods, both in terms of quantitative and qualitative metrics.
Related papers
- Contrastive Learning Guided Latent Diffusion Model for Image-to-Image Translation [7.218556478126324]
diffusion model has demonstrated superior performance in diverse and high-quality images for text-guided image translation.<n>We propose pix2pix-zeroCon, a zero-shot diffusion-based method that eliminates the need for additional training by leveraging patch-wise contrastive loss.<n>Our approach requires no additional training and operates directly on a pre-trained text-to-image diffusion model.
arXiv Detail & Related papers (2025-03-26T12:15:25Z) - Aligning Text to Image in Diffusion Models is Easier Than You Think [47.623236425067326]
SoftREPA is a lightweight contrastive fine-tuning strategy that leverages soft text tokens for representation alignment.<n>Our method explicitly increases the mutual information between text and image representations, leading to enhanced semantic consistency.
arXiv Detail & Related papers (2025-03-11T10:14:22Z) - Continuous Concepts Removal in Text-to-image Diffusion Models [27.262721132177845]
Concerns have been raised about the potential for text-to-image models to create content that infringes on copyrights or depicts disturbing subject matter.<n>We propose a novel approach called CCRT that includes a designed knowledge distillation paradigm.<n>It constrains the text-image alignment behavior during the continuous concept removal process by using a set of text prompts.
arXiv Detail & Related papers (2024-11-30T20:40:10Z) - Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps [24.372192691537897]
This work aims to enrich distilled text-to-image diffusion models with the ability to effectively encode real images into their latent space.<n>We introduce invertible Consistency Distillation (iCD), a generalized consistency distillation framework that facilitates both high-quality image synthesis and accurate image encoding in only 3-4 inference steps.<n>We demonstrate that iCD equipped with dynamic guidance may serve as a highly effective tool for zero-shot text-guided image editing, competing with more expensive state-of-the-art alternatives.
arXiv Detail & Related papers (2024-06-20T17:49:11Z) - Get What You Want, Not What You Don't: Image Content Suppression for
Text-to-Image Diffusion Models [86.92711729969488]
We analyze how to manipulate the text embeddings and remove unwanted content from them.
The first regularizes the text embedding matrix and effectively suppresses the undesired content.
The second method aims to further suppress the unwanted content generation of the prompt, and encourages the generation of desired content.
arXiv Detail & Related papers (2024-02-08T03:15:06Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - Improving Diffusion-based Image Translation using Asymmetric Gradient
Guidance [51.188396199083336]
We present an approach that guides the reverse process of diffusion sampling by applying asymmetric gradient guidance.
Our model's adaptability allows it to be implemented with both image-fusion and latent-dif models.
Experiments show that our method outperforms various state-of-the-art models in image translation tasks.
arXiv Detail & Related papers (2023-06-07T12:56:56Z) - High-Fidelity Guided Image Synthesis with Latent Diffusion Models [50.39294302741698]
The proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
arXiv Detail & Related papers (2022-11-30T15:43:20Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Diffusion-based Image Translation using Disentangled Style and Content
Representation [51.188396199083336]
Diffusion-based image translation guided by semantic texts or a single target image has enabled flexible style transfer.
It is often difficult to maintain the original content of the image during the reverse diffusion.
We present a novel diffusion-based unsupervised image translation method using disentangled style and content representation.
Our experimental results show that the proposed method outperforms state-of-the-art baseline models in both text-guided and image-guided translation tasks.
arXiv Detail & Related papers (2022-09-30T06:44:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.