Layout-Bridging Text-to-Image Synthesis
- URL: http://arxiv.org/abs/2208.06162v1
- Date: Fri, 12 Aug 2022 08:21:42 GMT
- Title: Layout-Bridging Text-to-Image Synthesis
- Authors: Jiadong Liang, Wenjie Pei and Feng Lu
- Abstract summary: We push for effective modeling in both text-to-image generation and layout-to-image synthesis.
We focus on learning the textual-visual semantic alignment per object in the layout to precisely incorporate the input text into the layout-to-image synthesis process.
- Score: 20.261873143881573
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The crux of text-to-image synthesis stems from the difficulty of preserving
the cross-modality semantic consistency between the input text and the
synthesized image. Typical methods, which seek to model the text-to-image
mapping directly, could only capture keywords in the text that indicates common
objects or actions but fail to learn their spatial distribution patterns. An
effective way to circumvent this limitation is to generate an image layout as
guidance, which is attempted by a few methods. Nevertheless, these methods fail
to generate practically effective layouts due to the diversity of input text
and object location. In this paper we push for effective modeling in both
text-to-layout generation and layout-to-image synthesis. Specifically, we
formulate the text-to-layout generation as a sequence-to-sequence modeling
task, and build our model upon Transformer to learn the spatial relationships
between objects by modeling the sequential dependencies between them. In the
stage of layout-to-image synthesis, we focus on learning the textual-visual
semantic alignment per object in the layout to precisely incorporate the input
text into the layout-to-image synthesizing process. To evaluate the quality of
generated layout, we design a new metric specifically, dubbed Layout Quality
Score, which considers both the absolute distribution errors of bounding boxes
in the layout and the mutual spatial relationships between them. Extensive
experiments on three datasets demonstrate the superior performance of our
method over state-of-the-art methods on both predicting the layout and
synthesizing the image from the given text.
Related papers
- GroundingBooth: Grounding Text-to-Image Customization [17.185571339157075]
We introduce GroundingBooth, a framework that achieves zero-shot instance-level spatial grounding on both foreground subjects and background objects.
Our proposed text-image grounding module and masked cross-attention layer allow us to generate personalized images with both accurate layout alignment and identity preservation.
arXiv Detail & Related papers (2024-09-13T03:40:58Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis [60.260724486834164]
This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries.
We present two key innovations: Vision Guidance and the Layered Rendering Diffusion framework.
We apply our method to three practical applications: bounding box-to-image, semantic mask-to-image and image editing.
arXiv Detail & Related papers (2023-11-30T10:36:19Z) - LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis [24.925757148750684]
We propose a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions.
LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods.
arXiv Detail & Related papers (2023-11-21T04:28:12Z) - A Parse-Then-Place Approach for Generating Graphic Layouts from Textual
Descriptions [50.469491454128246]
We use text as the guidance to create graphic layouts, i.e., Text-to-labeled, aiming to lower the design barriers.
Text-to-labeled is a challenging task, because it needs to consider the implicit, combined, and incomplete constraints from text.
We present a two-stage approach, named parse-then-place, to address this problem.
arXiv Detail & Related papers (2023-08-24T10:37:00Z) - LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image
Generation [121.45667242282721]
We propose a coarse-to-fine paradigm to achieve layout planning and image generation.
Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
arXiv Detail & Related papers (2023-08-09T17:45:04Z) - SceneComposer: Any-Level Semantic Image Synthesis [80.55876413285587]
We propose a new framework for conditional image synthesis from semantic layouts of any precision levels.
The framework naturally reduces to text-to-image (T2I) at the lowest level with no shape information, and it becomes segmentation-to-image (S2I) at the highest level.
We introduce several novel techniques to address the challenges coming with this new setup.
arXiv Detail & Related papers (2022-11-21T18:59:05Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z) - Person-in-Context Synthesiswith Compositional Structural Space [59.129960774988284]
We propose a new problem, textbfPersons in Context Synthesis, which aims to synthesize diverse person instance(s) in consistent contexts.
The context is specified by the bounding box object layout which lacks shape information, while pose of the person(s) by keypoints which are sparsely annotated.
To handle the stark difference in input structures, we proposed two separate neural branches to attentively composite the respective (context/person) inputs into shared compositional structural space''
This structural space is then decoded to the image space using multi-level feature modulation strategy, and learned in a self
arXiv Detail & Related papers (2020-08-28T14:33:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.