Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control
- URL: http://arxiv.org/abs/2404.13766v1
- Date: Sun, 21 Apr 2024 20:26:46 GMT
- Title: Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control
- Authors: Maria Mihaela Trusca, Wolf Nuyts, Jonathan Thomm, Robert Honig, Thomas Hofmann, Tinne Tuytelaars, Marie-Francine Moens,
- Abstract summary: Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image.
We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence.
We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
- Score: 58.37323932401379
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current diffusion models create photorealistic images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image. This is evidenced by our novel image-graph alignment model called EPViT (Edge Prediction Vision Transformer) for the evaluation of image-text alignment. To alleviate the above problem, we propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence. Additionally, the syntax structure of the prompt helps to disentangle the multimodal CLIP embeddings that are commonly used in T2I generation. The resulting DisCLIP embeddings and FCA are easily integrated in state-of-the-art diffusion models without additional training of these models. We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.\footnote{Code and data will be made available upon acceptance.
Related papers
- VisMin: Visual Minimal-Change Understanding [7.226130826257802]
We introduce a new, challenging benchmark termed textbfVisual textbfMinimal-Change Understanding (VisMin)
VisMin requires models to predict the correct image-caption match given two images and two captions.
We generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks.
arXiv Detail & Related papers (2024-07-23T18:10:43Z) - LocInv: Localization-aware Inversion for Text-Guided Image Editing [17.611103794346857]
Text-guided image editing research aims to empower users to manipulate generated images by altering the text prompts.
Existing image editing techniques are prone to editing over unintentional regions that are beyond the intended target area.
We propose localization-aware Inversion (LocInv), which exploits segmentation maps or bounding boxes as extra localization priors to refine the cross-attention maps.
arXiv Detail & Related papers (2024-05-02T17:27:04Z) - DEADiff: An Efficient Stylization Diffusion Model with Disentangled
Representations [64.43387739794531]
Current encoder-based approaches significantly impair the text controllability of text-to-image models while transferring styles.
We introduce DEADiff to address this issue using the following two strategies.
DEAiff attains the best visual stylization results and optimal balance between the text controllability inherent in the text-to-image model and style similarity to the reference image.
arXiv Detail & Related papers (2024-03-11T17:35:23Z) - Localizing and Editing Knowledge in Text-to-Image Generative Models [62.02776252311559]
knowledge about different attributes is not localized in isolated components, but is instead distributed amongst a set of components in the conditional UNet.
We introduce a fast, data-free model editing method Diff-QuickFix which can effectively edit concepts in text-to-image models.
arXiv Detail & Related papers (2023-10-20T17:31:12Z) - SingleInsert: Inserting New Concepts from a Single Image into
Text-to-Image Models for Flexible Editing [59.3017821001455]
SingleInsert is an image-to-text (I2T) inversion method with single source images containing the same concept.
In this work, we propose a simple and effective baseline for single-image I2T inversion, named SingleInsert.
With the proposed techniques, SingleInsert excels in single concept generation with high visual fidelity while allowing flexible editing.
arXiv Detail & Related papers (2023-10-12T07:40:39Z) - Dynamic Prompt Learning: Addressing Cross-Attention Leakage for
Text-Based Image Editing [23.00202969969574]
We propose Dynamic Prompt Learning (DPL) to force cross-attention maps to focus on correct noun words in the text prompt.
We show improved prompt editing results for Word-Swap, Prompt Refinement, and Attention Re-weighting, especially for complex multi-object scenes.
arXiv Detail & Related papers (2023-09-27T13:55:57Z) - Compositional Text-to-Image Synthesis with Attention Map Control of
Diffusion Models [8.250234707160793]
Recent text-to-image (T2I) diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts.
They fail to semantically align the generated images with the prompts due to their limited compositional capabilities.
We propose a novel attention mask control strategy based on predicted object boxes to address these issues.
arXiv Detail & Related papers (2023-05-23T10:49:22Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.