Related papers: Grounded Text-to-Image Synthesis with Attention Refocusing

Grounded Text-to-Image Synthesis with Attention Refocusing

URL: http://arxiv.org/abs/2306.05427v2
Date: Sat, 2 Dec 2023 02:02:54 GMT
Title: Grounded Text-to-Image Synthesis with Attention Refocusing
Authors: Quynh Phung, Songwei Ge, Jia-Bin Huang
Abstract summary: We reveal the potential causes in the diffusion model's cross-attention and self-attention layers. We propose two novel losses to refocus attention maps according to a given spatial layout during sampling. We show that our proposed attention refocusing effectively improves the controllability of existing approaches.
Score: 16.9170825951175
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Driven by the scalable diffusion models trained on large-scale datasets, text-to-image synthesis methods have shown compelling results. However, these models still fail to precisely follow the text prompt involving multiple objects, attributes, or spatial compositions. In this paper, we reveal the potential causes in the diffusion model's cross-attention and self-attention layers. We propose two novel losses to refocus attention maps according to a given spatial layout during sampling. Creating the layouts manually requires additional effort and can be tedious. Therefore, we explore using large language models (LLM) to produce these layouts for our method. We conduct extensive experiments on the DrawBench, HRS, and TIFA benchmarks to evaluate our proposed method. We show that our proposed attention refocusing effectively improves the controllability of existing approaches.

Related papers

Coherent and Multi-modality Image Inpainting via Latent Space Optimization [61.99406669027195]
PILOT (intextbfPainting vtextbfIa textbfLatent textbfOptextbfTimization) is an optimization approach grounded on a novel textitsemantic centralization and textitbackground preservation loss. Our method searches latent spaces capable of generating inpainted regions that exhibit high fidelity to user-provided prompts while maintaining coherence with the background.
arXiv Detail & Related papers (2024-07-10T19:58:04Z)
Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers [120.49126407479717]
This paper explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR) We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos.
arXiv Detail & Related papers (2024-03-12T00:02:03Z)
Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models [23.786473791344395]
Cross-attention layers in diffusion models tend to disproportionately focus on certain tokens during the generation process. We introduce attention regulation, an on-the-fly optimization approach at inference time to align attention maps with the input text prompt. Experiment results show that our method consistently outperforms other baselines.
arXiv Detail & Related papers (2024-03-11T02:18:27Z)
Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive [21.49096276631859]
Current L2I models either suffer from poor editability via text or weak alignment between the generated image and the input layout. We propose to integrate adversarial supervision into the conventional training pipeline of L2I diffusion models (ALDM) Specifically, we employ a segmentation-based discriminator which provides explicit feedback to the diffusion generator on the pixel-level alignment between the denoised image and the input layout.
arXiv Detail & Related papers (2024-01-16T20:31:46Z)
Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z)
Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models [58.46926334842161]
This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps. We propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores. Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability.
arXiv Detail & Related papers (2023-12-10T22:07:42Z)
Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis [60.260724486834164]
This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries. We present two key innovations: Vision Guidance and the Layered Rendering Diffusion framework. We apply our method to three practical applications: bounding box-to-image, semantic mask-to-image and image editing.
arXiv Detail & Related papers (2023-11-30T10:36:19Z)
R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation [74.5598315066249]
We probe into zero-shot grounded T2I generation with diffusion models. We propose a Region and Boundary (R&B) aware cross-attention guidance approach.
arXiv Detail & Related papers (2023-10-13T05:48:42Z)
Light Field Saliency Detection with Dual Local Graph Learning andReciprocative Guidance [148.9832328803202]
We model the infor-mation fusion within focal stack via graph networks. We build a novel dual graph modelto guide the focal stack fusion process using all-focus pat-terns.
arXiv Detail & Related papers (2021-10-02T00:54:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.