LLM-guided Instance-level Image Manipulation with Diffusion U-Net Cross-Attention Maps
- URL: http://arxiv.org/abs/2501.14046v1
- Date: Thu, 23 Jan 2025 19:26:14 GMT
- Title: LLM-guided Instance-level Image Manipulation with Diffusion U-Net Cross-Attention Maps
- Authors: Andrey Palaev, Adil Khan, Syed M. Ahsan Kazmi,
- Abstract summary: We propose a pipeline leveraging Large Language Models, open-vocabulary detectors, cross-attention maps and diffusion U-Net for instance-level image manipulation.
Our method detects objects mentioned in the prompt and present in the generated image, enabling precise manipulation without extensive training or input masks.
- Score: 5.836227628651603
- License:
- Abstract: The advancement of text-to-image synthesis has introduced powerful generative models capable of creating realistic images from textual prompts. However, precise control over image attributes remains challenging, especially at the instance level. While existing methods offer some control through fine-tuning or auxiliary information, they often face limitations in flexibility and accuracy. To address these challenges, we propose a pipeline leveraging Large Language Models (LLMs), open-vocabulary detectors, cross-attention maps and intermediate activations of diffusion U-Net for instance-level image manipulation. Our method detects objects mentioned in the prompt and present in the generated image, enabling precise manipulation without extensive training or input masks. By incorporating cross-attention maps, our approach ensures coherence in manipulated images while controlling object positions. Our method enables precise manipulations at the instance level without fine-tuning or auxiliary information such as masks or bounding boxes. Code is available at https://github.com/Palandr123/DiffusionU-NetLLM
Related papers
- Generating Compositional Scenes via Text-to-image RGBA Instance Generation [82.63805151691024]
Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering.
We propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity.
Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes.
arXiv Detail & Related papers (2024-11-16T23:44:14Z) - Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control [73.6361029556484]
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs.
We consider pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts.
We show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
arXiv Detail & Related papers (2024-05-09T15:39:54Z) - Masked-Attention Diffusion Guidance for Spatially Controlling
Text-to-Image Generation [1.0152838128195465]
We propose a method for spatially controlling text-to-image generation without further training of diffusion models.
Our aim is to control the attention maps according to given semantic masks and text prompts.
arXiv Detail & Related papers (2023-08-11T09:15:22Z) - Diffusion Self-Guidance for Controllable Image Generation [106.59989386924136]
Self-guidance provides greater control over generated images by guiding the internal representations of diffusion models.
We show how a simple set of properties can be composed to perform challenging image manipulations.
We also show that self-guidance can be used to edit real images.
arXiv Detail & Related papers (2023-06-01T17:59:56Z) - Compositional Text-to-Image Synthesis with Attention Map Control of
Diffusion Models [8.250234707160793]
Recent text-to-image (T2I) diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts.
They fail to semantically align the generated images with the prompts due to their limited compositional capabilities.
We propose a novel attention mask control strategy based on predicted object boxes to address these issues.
arXiv Detail & Related papers (2023-05-23T10:49:22Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - LDEdit: Towards Generalized Text Guided Image Manipulation via Latent
Diffusion Models [12.06277444740134]
generic image manipulation using a single model with flexible text inputs is highly desirable.
Recent work addresses this task by guiding generative models trained on the generic image using pretrained vision-language encoders.
We propose an optimization-free method for the task of generic image manipulation from text prompts.
arXiv Detail & Related papers (2022-10-05T13:26:15Z) - StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery [71.1862388442953]
We develop a text-based interface for StyleGAN image manipulation.
We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt.
Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation.
arXiv Detail & Related papers (2021-03-31T17:51:25Z) - Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary
Instructions [66.82547612097194]
We propose a novel algorithm, named Open-Edit, which is the first attempt on open-domain image manipulation with open-vocabulary instructions.
Our approach takes advantage of the unified visual-semantic embedding space pretrained on a general image-caption dataset.
We show promising results in manipulating open-vocabulary color, texture, and high-level attributes for various scenarios of open-domain images.
arXiv Detail & Related papers (2020-08-04T14:15:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.