Related papers: R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

URL: http://arxiv.org/abs/2310.08872v5
Date: Mon, 27 Nov 2023 08:42:07 GMT
Title: R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation
Authors: Jiayu Xiao, Henglei Lv, Liang Li, Shuhui Wang, Qingming Huang
Abstract summary: We probe into zero-shot grounded T2I generation with diffusion models. We propose a Region and Boundary (R&B) aware cross-attention guidance approach.
Score: 74.5598315066249
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images given text-prompts as input. However, these models fail to convey appropriate spatial composition specified by a layout instruction. In this work, we probe into zero-shot grounded T2I generation with diffusion models, that is, generating images corresponding to the input layout information without training auxiliary modules or finetuning diffusion models. We propose a Region and Boundary (R&B) aware cross-attention guidance approach that gradually modulates the attention maps of diffusion model during generative process, and assists the model to synthesize images (1) with high fidelity, (2) highly compatible with textual input, and (3) interpreting layout instructions accurately. Specifically, we leverage the discrete sampling to bridge the gap between consecutive attention maps and discrete layout constraints, and design a region-aware loss to refine the generative layout during diffusion process. We further propose a boundary-aware loss to strengthen object discriminability within the corresponding regions. Experimental results show that our method outperforms existing state-of-the-art zero-shot grounded T2I generation methods by a large margin both qualitatively and quantitatively on several benchmarks.

Related papers

From Missing Pieces to Masterpieces: Image Completion with Context-Adaptive Diffusion [98.31811240195324]
ConFill is a novel framework that reduces discrepancies between generated and original images at each diffusion step. It outperforms current methods, setting a new benchmark in image completion.
arXiv Detail & Related papers (2025-04-19T13:40:46Z)
Boundary Attention Constrained Zero-Shot Layout-To-Image Generation [47.435234391588494]
Recent text-to-image diffusion models excel at generating high-resolution images from text but struggle with precise control over spatial composition and object counting. We propose a novel zero-shot L2I approach, BACON, which eliminates the need for additional modules or fine-tuning. We leverage pixel-to-pixel correlations in the self-attention feature maps to align cross-attention maps and combine three loss functions constrained by boundary attention to update latent features.
arXiv Detail & Related papers (2024-11-15T05:44:45Z)
Coherent and Multi-modality Image Inpainting via Latent Space Optimization [61.99406669027195]
PILOT (intextbfPainting vtextbfIa textbfLatent textbfOptextbfTimization) is an optimization approach grounded on a novel textitsemantic centralization and textitbackground preservation loss. Our method searches latent spaces capable of generating inpainted regions that exhibit high fidelity to user-provided prompts while maintaining coherence with the background.
arXiv Detail & Related papers (2024-07-10T19:58:04Z)
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? [10.72249123249003]
We revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding. We introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions. LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS dataset with 38.2 BLEU@4 and 126.2 CIDEr.
arXiv Detail & Related papers (2024-04-16T17:47:16Z)
Text-Image Conditioned Diffusion for Consistent Text-to-3D Generation [28.079441901818296]
We propose a text-to-3D method for Neural Radiance Fields (NeRFs) that explicitly enforces fine-grained view consistency. Our method achieves state-of-the-art performance over existing text-to-3D methods.
arXiv Detail & Related papers (2023-12-19T01:09:49Z)
Dense Text-to-Image Generation with Attention Modulation [49.287458275920514]
Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions. We propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions. We achieve similar-quality visual results with models specifically trained with layout conditions.
arXiv Detail & Related papers (2023-08-24T17:59:01Z)
DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process. We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z)
Grounded Text-to-Image Synthesis with Attention Refocusing [16.9170825951175]
We reveal the potential causes in the diffusion model's cross-attention and self-attention layers. We propose two novel losses to refocus attention maps according to a given spatial layout during sampling. We show that our proposed attention refocusing effectively improves the controllability of existing approaches.
arXiv Detail & Related papers (2023-06-08T17:59:59Z)
RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment [112.45442468794658]
We propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff. In the coarse semantic re-alignment phase, a novel caption reward is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt. The fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view.
arXiv Detail & Related papers (2023-05-31T06:59:21Z)
SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis [38.22195812238951]
We propose a novel guidance approach for the sampling process in the diffusion model. Our approach guides the model with semantic features from CLIP embeddings and enforces geometric constraints. Our results demonstrate the effectiveness of incorporating bounding box and segmentation map guidance in the diffusion model sampling process.
arXiv Detail & Related papers (2023-04-28T00:14:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.