Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis
- URL: http://arxiv.org/abs/2208.13753v1
- Date: Mon, 29 Aug 2022 17:37:29 GMT
- Title: Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis
- Authors: Wan-Cyuan Fan, Yen-Chun Chen, DongDong Chen, Yu Cheng, Lu Yuan,
Yu-Chiang Frank Wang
- Abstract summary: We present Frido, a Feature Pyramid Diffusion model performing a multi-scale coarse-to-fine denoising process for image synthesis.
Our model decomposes an input image into scale-dependent vector quantized features, followed by a coarse-to-fine gating for producing image output.
We conduct extensive experiments over various unconditioned and conditional image generation tasks, ranging from text-to-image synthesis, layout-to-image, scene-graph-to-image, to label-to-image.
- Score: 77.23998762763078
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models (DMs) have shown great potential for high-quality image
synthesis. However, when it comes to producing images with complex scenes, how
to properly describe both image global structures and object details remains a
challenging task. In this paper, we present Frido, a Feature Pyramid Diffusion
model performing a multi-scale coarse-to-fine denoising process for image
synthesis. Our model decomposes an input image into scale-dependent vector
quantized features, followed by a coarse-to-fine gating for producing image
output. During the above multi-scale representation learning stage, additional
input conditions like text, scene graph, or image layout can be further
exploited. Thus, Frido can be also applied for conditional or cross-modality
image synthesis. We conduct extensive experiments over various unconditioned
and conditional image generation tasks, ranging from text-to-image synthesis,
layout-to-image, scene-graph-to-image, to label-to-image. More specifically, we
achieved state-of-the-art FID scores on five benchmarks, namely layout-to-image
on COCO and OpenImages, scene-graph-to-image on COCO and Visual Genome, and
label-to-image on COCO. Code is available at
https://github.com/davidhalladay/Frido.
Related papers
- SAMPLING: Scene-adaptive Hierarchical Multiplane Images Representation
for Novel View Synthesis from a Single Image [60.52991173059486]
We introduce SAMPLING, a Scene-adaptive Hierarchical Multiplane Images Representation for Novel View Synthesis from a Single Image.
Our method demonstrates considerable performance gains in large-scale unbounded outdoor scenes using a single image on the KITTI dataset.
arXiv Detail & Related papers (2023-09-12T15:33:09Z) - Composer: Creative and Controllable Image Synthesis with Composable
Conditions [57.78533372393828]
Recent large-scale generative models learned on big data are capable of synthesizing incredible images yet suffer from limited controllability.
This work offers a new generation paradigm that allows flexible control of the output image, such as spatial layout and palette, while maintaining the synthesis quality and model creativity.
arXiv Detail & Related papers (2023-02-20T05:48:41Z) - A Shared Representation for Photorealistic Driving Simulators [83.5985178314263]
We propose to improve the quality of generated images by rethinking the discriminator architecture.
The focus is on the class of problems where images are generated given semantic inputs, such as scene segmentation maps or human body poses.
We aim to learn a shared latent representation that encodes enough information to jointly do semantic segmentation, content reconstruction, along with a coarse-to-fine grained adversarial reasoning.
arXiv Detail & Related papers (2021-12-09T18:59:21Z) - Improving Visual Quality of Image Synthesis by A Token-based Generator
with Transformers [51.581926074686535]
We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem.
The proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks.
arXiv Detail & Related papers (2021-11-05T12:57:50Z) - ImageBART: Bidirectional Context with Multinomial Diffusion for
Autoregressive Image Synthesis [15.006676130258372]
Autoregressive models incorporate context in a linear 1D order by attending only to previously synthesized image patches above or to the left.
We propose a coarse-to-fine hierarchy of context by combining the autoregressive formulation with a multinomial diffusion process.
Our approach can take unrestricted, user-provided masks into account to perform local image editing.
arXiv Detail & Related papers (2021-08-19T17:50:07Z) - High-Resolution Complex Scene Synthesis with Transformers [6.445605125467574]
coarse-grained synthesis of complex scene images via deep generative models has recently gained popularity.
We present an approach to this task, where the generative model is based on pure likelihood training without additional objectives.
We show that the resulting system is able to synthesize high-quality images consistent with the given layouts.
arXiv Detail & Related papers (2021-05-13T17:56:07Z) - Multimodal Image Synthesis with Conditional Implicit Maximum Likelihood
Estimation [54.17177006826262]
We develop a new generic conditional image synthesis method based on Implicit Maximum Likelihood Estimation (IMLE)
We demonstrate improved multimodal image synthesis performance on two tasks, single image super-resolution and image synthesis from scene layouts.
arXiv Detail & Related papers (2020-04-07T03:06:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.