Related papers: HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation

HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation

URL: http://arxiv.org/abs/2410.14324v1
Date: Fri, 18 Oct 2024 09:36:10 GMT
Title: HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation
Authors: Bo Cheng, Yuhang Ma, Liebucha Wu, Shanyuan Liu, Ao Ma, Xiaoyu Wu, Dawei Leng, Yuhui Yin,
Abstract summary: We propose a textbfHierarchical textbfControllable (HiCo) diffusion model for layout-to-image generation. Our key insight is to achieve spatial disentanglement through hierarchical modeling of layouts. To evaluate the performance of multi-objective controllable layout generation in natural scenes, we introduce the HiCo-7K benchmark.
Score: 11.087309945227826
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The task of layout-to-image generation involves synthesizing images based on the captions of objects and their spatial positions. Existing methods still struggle in complex layout generation, where common bad cases include object missing, inconsistent lighting, conflicting view angles, etc. To effectively address these issues, we propose a \textbf{Hi}erarchical \textbf{Co}ntrollable (HiCo) diffusion model for layout-to-image generation, featuring object seperable conditioning branch structure. Our key insight is to achieve spatial disentanglement through hierarchical modeling of layouts. We use a multi branch structure to represent hierarchy and aggregate them in fusion module. To evaluate the performance of multi-objective controllable layout generation in natural scenes, we introduce the HiCo-7K benchmark, derived from the GRIT-20M dataset and manually cleaned. https://github.com/360CVGroup/HiCo_T2I.

Related papers

STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout-to-Image Generation [4.769823364778397]
We propose a diffusion-based model that produces photo-realistic images and provides fine-grained control of stylized objects in scenes. Our approach learns a global condition for each layout, and a self-supervised semantic map for weight modulation. A new Styled-Mask Attention (SM Attention) is also introduced to cross-condition the global condition and image feature for capturing the objects' relationships.
arXiv Detail & Related papers (2025-03-15T17:36:24Z)
ToLo: A Two-Stage, Training-Free Layout-To-Image Generation Framework For High-Overlap Layouts [2.799190378263432]
ToLo is a two-stage, training-free layout-to-image generation framework for high-overlap layouts. We show that ToLo significantly enhances the performance of existing methods when dealing with high-overlap layouts.
arXiv Detail & Related papers (2025-03-03T15:41:51Z)
GroundingBooth: Grounding Text-to-Image Customization [17.185571339157075]
We introduce GroundingBooth, a framework that achieves zero-shot instance-level spatial grounding on both foreground subjects and background objects. Our proposed text-image grounding module and masked cross-attention layer allow us to generate personalized images with both accurate layout alignment and identity preservation.
arXiv Detail & Related papers (2024-09-13T03:40:58Z)
PAIR-Diffusion: A Comprehensive Multimodal Object-Level Image Editor [135.17302411419834]
PAIR Diffusion is a generic framework that enables a diffusion model to control the structure and appearance of each object in the image. We show that having control over the properties of each object in an image leads to comprehensive editing capabilities. Our framework allows for various object-level editing operations on real images such as reference image-based appearance editing, free-form shape editing, adding objects, and variations.
arXiv Detail & Related papers (2023-03-30T17:13:56Z)
LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation [46.567682868550285]
We propose a diffusion model named LayoutDiffusion that can obtain higher generation quality and greater controllability than the previous works. In this paper, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form. Our experiments show that our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG Code.
arXiv Detail & Related papers (2023-03-30T06:56:12Z)
COFS: Controllable Furniture layout Synthesis [40.68096097121981]
Many existing methods tackle this problem as a sequence generation problem which imposes a specific ordering on the elements of the layout. We propose COFS, an architecture based on standard transformer architecture blocks from language modeling. Our model consistently outperforms other methods which we verify by performing quantitative evaluations.
arXiv Detail & Related papers (2022-05-29T13:31:18Z)
Constrained Graphic Layout Generation via Latent Optimization [17.05026043385661]
We generate graphic layouts that can flexibly incorporate design semantics, either specified implicitly or explicitly by a user. Our approach builds on a generative layout model based on a Transformer architecture, and formulates the layout generation as a constrained optimization problem. We show in the experiments that our approach is capable of generating realistic layouts in both constrained and unconstrained generation tasks with a single model.
arXiv Detail & Related papers (2021-08-02T13:04:11Z)
Scene Graph to Image Generation with Contextualized Object Layout Refinement [92.85331019618332]
We propose a novel method to generate images from scene graphs. Our approach improves the layout coverage by almost 20 points and drops object overlap to negligible amounts.
arXiv Detail & Related papers (2020-09-23T06:27:54Z)
Improving Semantic Segmentation via Decoupled Body and Edge Supervision [89.57847958016981]
Existing semantic segmentation approaches either aim to improve the object's inner consistency by modeling the global context, or refine objects detail along their boundaries by multi-scale feature fusion. In this paper, a new paradigm for semantic segmentation is proposed. Our insight is that appealing performance of semantic segmentation requires textitexplicitly modeling the object textitbody and textitedge, which correspond to the high and low frequency of the image. We show that the proposed framework with various baselines or backbone networks leads to better object inner consistency and object boundaries.
arXiv Detail & Related papers (2020-07-20T12:11:22Z)
LayoutTransformer: Layout Generation and Completion with Self-attention [105.21138914859804]
We address the problem of scene layout generation for diverse domains such as images, mobile applications, documents, and 3D objects. We propose LayoutTransformer, a novel framework that leverages self-attention to learn contextual relationships between layout elements. Our framework allows us to generate a new layout either from an empty set or from an initial seed set of primitives, and can easily scale to support an arbitrary of primitives per layout.
arXiv Detail & Related papers (2020-06-25T17:56:34Z)
Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects. Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity. We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.