Localized Text-to-Image Generation for Free via Cross Attention Control
- URL: http://arxiv.org/abs/2306.14636v1
- Date: Mon, 26 Jun 2023 12:15:06 GMT
- Title: Localized Text-to-Image Generation for Free via Cross Attention Control
- Authors: Yutong He, Ruslan Salakhutdinov, J. Zico Kolter
- Abstract summary: We show that localized generation can be achieved by simply controlling cross attention maps during inference.
Our proposed cross attention control (CAC) provides new open-vocabulary localization abilities to standard text-to-image models.
- Score: 154.06530917754515
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the tremendous success in text-to-image generative models, localized
text-to-image generation (that is, generating objects or features at specific
locations in an image while maintaining a consistent overall generation) still
requires either explicit training or substantial additional inference time. In
this work, we show that localized generation can be achieved by simply
controlling cross attention maps during inference. With no additional training,
model architecture modification or inference time, our proposed cross attention
control (CAC) provides new open-vocabulary localization abilities to standard
text-to-image models. CAC also enhances models that are already trained for
localized generation when deployed at inference time. Furthermore, to assess
localized text-to-image generation performance automatically, we develop a
standardized suite of evaluations using large pretrained recognition models.
Our experiments show that CAC improves localized generation performance with
various types of location information ranging from bounding boxes to semantic
segmentation maps, and enhances the compositional capability of
state-of-the-art text-to-image generative models.
Related papers
- Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning [40.06403155373455]
We propose a novel reinforcement learning framework for personalized text-to-image generation.
Our proposed approach outperforms existing state-of-the-art methods by a large margin on visual fidelity while maintaining text-alignment.
arXiv Detail & Related papers (2024-07-09T08:11:53Z) - Regularized Training with Generated Datasets for Name-Only Transfer of Vision-Language Models [36.59260354292177]
Recent advancements in text-to-image generation have inspired researchers to generate datasets tailored for perception models using generative models.
We aim to fine-tune vision-language models to a specific classification model without access to any real images.
Despite the high fidelity of generated images, we observed a significant performance degradation when fine-tuning the model using the generated datasets.
arXiv Detail & Related papers (2024-06-08T10:43:49Z) - Active Generation for Image Classification [45.93535669217115]
We propose to address the efficiency of image generation by focusing on the specific needs and characteristics of the model.
With a central tenet of active learning, our method, named ActGen, takes a training-aware approach to image generation.
arXiv Detail & Related papers (2024-03-11T08:45:31Z) - SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form
Layout-to-Image Generation [68.42476385214785]
We propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance.
SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works.
We also propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms.
arXiv Detail & Related papers (2023-08-20T04:09:12Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - GIT: A Generative Image-to-text Transformer for Vision and Language [138.91581326369837]
We train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.
Our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr)
arXiv Detail & Related papers (2022-05-27T17:03:38Z) - CAGAN: Text-To-Image Generation with Combined Attention GANs [70.3497683558609]
We propose the Combined Attention Generative Adversarial Network (CAGAN) to generate photo-realistic images according to textual descriptions.
The proposed CAGAN uses two attention models: word attention to draw different sub-regions conditioned on related words; and squeeze-and-excitation attention to capture non-linear interaction among channels.
With spectral normalisation to stabilise training, our proposed CAGAN improves the state of the art on the IS and FID on the CUB dataset and the FID on the more challenging COCO dataset.
arXiv Detail & Related papers (2021-04-26T15:46:40Z) - Local Class-Specific and Global Image-Level Generative Adversarial
Networks for Semantic-Guided Scene Generation [135.4660201856059]
We consider learning the scene generation in a local context, and design a local class-specific generative network with semantic maps as a guidance.
To learn more discrimi class-specific feature representations for the local generation, a novel classification module is also proposed.
Experiments on two scene image generation tasks show superior generation performance of the proposed model.
arXiv Detail & Related papers (2019-12-27T16:14:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.