SceneComposer: Any-Level Semantic Image Synthesis
- URL: http://arxiv.org/abs/2211.11742v1
- Date: Mon, 21 Nov 2022 18:59:05 GMT
- Title: SceneComposer: Any-Level Semantic Image Synthesis
- Authors: Yu Zeng, Zhe Lin, Jianming Zhang, Qing Liu, John Collomosse, Jason
Kuen, Vishal M. Patel
- Abstract summary: We propose a new framework for conditional image synthesis from semantic layouts of any precision levels.
The framework naturally reduces to text-to-image (T2I) at the lowest level with no shape information, and it becomes segmentation-to-image (S2I) at the highest level.
We introduce several novel techniques to address the challenges coming with this new setup.
- Score: 80.55876413285587
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a new framework for conditional image synthesis from semantic
layouts of any precision levels, ranging from pure text to a 2D semantic canvas
with precise shapes. More specifically, the input layout consists of one or
more semantic regions with free-form text descriptions and adjustable precision
levels, which can be set based on the desired controllability. The framework
naturally reduces to text-to-image (T2I) at the lowest level with no shape
information, and it becomes segmentation-to-image (S2I) at the highest level.
By supporting the levels in-between, our framework is flexible in assisting
users of different drawing expertise and at different stages of their creative
workflow. We introduce several novel techniques to address the challenges
coming with this new setup, including a pipeline for collecting training data;
a precision-encoded mask pyramid and a text feature map representation to
jointly encode precision level, semantics, and composition information; and a
multi-scale guided diffusion model to synthesize images. To evaluate the
proposed method, we collect a test dataset containing user-drawn layouts with
diverse scenes and styles. Experimental results show that the proposed method
can generate high-quality images following the layout at given precision, and
compares favorably against existing methods. Project page
\url{https://zengxianyu.github.io/scenec/}
Related papers
- Self-supervised Scene Text Segmentation with Object-centric Layered
Representations Augmented by Text Regions [22.090074821554754]
We propose a self-supervised scene text segmentation algorithm with layered decoupling of representations derived from the object-centric manner to segment images into texts and background.
On several public scene text datasets, our method outperforms the state-of-the-art unsupervised segmentation algorithms.
arXiv Detail & Related papers (2023-08-25T05:00:05Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - Layout-Bridging Text-to-Image Synthesis [20.261873143881573]
We push for effective modeling in both text-to-image generation and layout-to-image synthesis.
We focus on learning the textual-visual semantic alignment per object in the layout to precisely incorporate the input text into the layout-to-image synthesis process.
arXiv Detail & Related papers (2022-08-12T08:21:42Z) - DT2I: Dense Text-to-Image Generation from Region Descriptions [3.883984493622102]
We introduce dense text-to-image (DT2I) synthesis as a new task to pave the way toward more intuitive image generation.
We also propose DTC-GAN, a novel method to generate images from semantically rich region descriptions.
arXiv Detail & Related papers (2022-04-05T07:57:11Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - Semantic Layout Manipulation with High-Resolution Sparse Attention [106.59650698907953]
We tackle the problem of semantic image layout manipulation, which aims to manipulate an input image by editing its semantic label map.
A core problem of this task is how to transfer visual details from the input images to the new semantic layout while making the resulting image visually realistic.
We propose a high-resolution sparse attention module that effectively transfers visual details to new layouts at a resolution up to 512x512.
arXiv Detail & Related papers (2020-12-14T06:50:43Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.