Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement
- URL: http://arxiv.org/abs/2411.06558v2
- Date: Fri, 15 Nov 2024 14:38:32 GMT
- Title: Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement
- Authors: Zhennan Chen, Yajie Li, Haofan Wang, Zhibo Chen, Zhengkai Jiang, Jun Li, Qian Wang, Jian Yang, Ying Tai,
- Abstract summary: We present RAG, a Regional-Aware text-to-image Generation method conditioned on regional descriptions for precise layout composition.
RAG achieves superior performance over attribute binding and object relationship than previous tuning-free methods.
- Score: 40.94329069897935
- License:
- Abstract: Regional prompting, or compositional generation, which enables fine-grained spatial control, has gained increasing attention for its practicality in real-world applications. However, previous methods either introduce additional trainable modules, thus only applicable to specific models, or manipulate on score maps within cross-attention layers using attention masks, resulting in limited control strength when the number of regions increases. To handle these limitations, we present RAG, a Regional-Aware text-to-image Generation method conditioned on regional descriptions for precise layout composition. RAG decouple the multi-region generation into two sub-tasks, the construction of individual region (Regional Hard Binding) that ensures the regional prompt is properly executed, and the overall detail refinement (Regional Soft Refinement) over regions that dismiss the visual boundaries and enhance adjacent interactions. Furthermore, RAG novelly makes repainting feasible, where users can modify specific unsatisfied regions in the last generation while keeping all other regions unchanged, without relying on additional inpainting models. Our approach is tuning-free and applicable to other frameworks as an enhancement to the prompt following property. Quantitative and qualitative experiments demonstrate that RAG achieves superior performance over attribute binding and object relationship than previous tuning-free methods.
Related papers
- Differentiable Reasoning about Knowledge Graphs with Region-based Graph Neural Networks [62.93577376960498]
Methods for knowledge graph (KG) completion need to capture semantic regularities and use these regularities to infer plausible knowledge that is not explicitly stated.
Most embedding-based methods are opaque in the kinds of regularities they can capture, although region-based KG embedding models have emerged as a more transparent alternative.
We propose RESHUFFLE, a simple model based on ordering constraints that can faithfully capture a much larger class of rule bases than existing approaches.
arXiv Detail & Related papers (2024-06-13T18:37:24Z) - RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection [20.630629383286262]
Open-vocabulary object detection requires solid modeling of the region-semantic relationship.
We propose RTGen to generate scalable open-vocabulary region-text pairs.
arXiv Detail & Related papers (2024-05-30T09:03:23Z) - RegionGPT: Towards Region Understanding Vision Language Model [88.42271128373191]
RegionGPT (short as RGPT) is a novel framework designed for complex region-level captioning and understanding.
We develop an automated region caption data generation pipeline, enriching the training set with detailed region-level captions.
We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of region-level tasks.
arXiv Detail & Related papers (2024-03-04T18:58:08Z) - Local Conditional Controlling for Text-to-Image Diffusion Models [26.54188248406709]
Diffusion models have exhibited impressive prowess in the text-to-image task.
Recent methods add image-level structure controls, e.g., edge and depth maps, to manipulate the generation process together with text prompts to obtain desired images.
This controlling process is globally operated on the entire image, which limits the flexibility of control regions.
arXiv Detail & Related papers (2023-12-14T09:31:33Z) - SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form
Layout-to-Image Generation [68.42476385214785]
We propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance.
SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works.
We also propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms.
arXiv Detail & Related papers (2023-08-20T04:09:12Z) - Region-Aware Diffusion for Zero-shot Text-driven Image Editing [78.58917623854079]
We propose a novel region-aware diffusion model (RDM) for entity-level image editing.
To strike a balance between image fidelity and inference speed, we design the intensive diffusion pipeline.
The results show that RDM outperforms the previous approaches in terms of visual quality, overall harmonization, non-editing region content preservation, and text-image semantic consistency.
arXiv Detail & Related papers (2023-02-23T06:20:29Z) - Region-Based Semantic Factorization in GANs [67.90498535507106]
We present a highly efficient algorithm to factorize the latent semantics learned by Generative Adversarial Networks (GANs) concerning an arbitrary image region.
Through an appropriately defined generalized Rayleigh quotient, we solve such a problem without any annotations or training.
Experimental results on various state-of-the-art GAN models demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-02-19T17:46:02Z) - Domain Adaptive Semantic Segmentation with Regional Contrastive
Consistency Regularization [19.279884432843822]
We propose a novel and fully end-to-end trainable approach, called regional contrastive consistency regularization (RCCR) for domain adaptive semantic segmentation.
Our core idea is to pull the similar regional features extracted from the same location of different images to be closer, and meanwhile push the features from the different locations of the two images to be separated.
arXiv Detail & Related papers (2021-10-11T11:45:00Z) - Translate the Facial Regions You Like Using Region-Wise Normalization [27.288255234645472]
We propose a region-wise normalization framework, for region level face translation.
Both shape and texture of different regions can thus be translated to various target styles.
Our approach has further advantages in precise control of the regions to be translated.
arXiv Detail & Related papers (2020-07-29T05:55:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.