ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with
Diffusion Models
- URL: http://arxiv.org/abs/2310.07702v1
- Date: Wed, 11 Oct 2023 17:52:39 GMT
- Title: ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with
Diffusion Models
- Authors: Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia,
Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, Ying Shan
- Abstract summary: We investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes.
Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues.
We propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference.
- Score: 126.35334860896373
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we investigate the capability of generating images from
pre-trained diffusion models at much higher resolutions than the training image
sizes. In addition, the generated images should have arbitrary image aspect
ratios. When generating images directly at a higher resolution, 1024 x 1024,
with the pre-trained Stable Diffusion using training images of resolution 512 x
512, we observe persistent problems of object repetition and unreasonable
object structures. Existing works for higher-resolution generation, such as
attention-based and joint-diffusion approaches, cannot well address these
issues. As a new perspective, we examine the structural components of the U-Net
in diffusion models and identify the crucial cause as the limited perception
field of convolutional kernels. Based on this key observation, we propose a
simple yet effective re-dilation that can dynamically adjust the convolutional
perception field during inference. We further propose the dispersed convolution
and noise-damped classifier-free guidance, which can enable
ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our
approach does not require any training or optimization. Extensive experiments
demonstrate that our approach can address the repetition issue well and achieve
state-of-the-art performance on higher-resolution image synthesis, especially
in texture details. Our work also suggests that a pre-trained diffusion model
trained on low-resolution images can be directly used for high-resolution
visual generation without further tuning, which may provide insights for future
research on ultra-high-resolution image and video synthesis.
Related papers
- DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance [11.44012694656102]
Large-scale generative models, such as text-to-image diffusion models, have garnered widespread attention across diverse domains.
Existing large-scale diffusion models are confined to generating images of up to 1K resolution.
We propose a novel progressive approach that fully utilizes generated low-resolution images to guide the generation of higher-resolution images.
arXiv Detail & Related papers (2024-06-26T16:10:31Z) - Image Neural Field Diffusion Models [46.781775067944395]
We propose to learn the distribution of continuous images by training diffusion models on image neural fields.
We show that image neural field diffusion models can be trained using mixed-resolution image datasets, outperform fixed-resolution diffusion models, and can solve inverse problems with conditions applied at different scales efficiently.
arXiv Detail & Related papers (2024-06-11T17:24:02Z) - FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis [48.9652334528436]
We introduce an innovative, training-free approach FouriScale from the perspective of frequency domain analysis.
We replace the original convolutional layers in pre-trained diffusion models by incorporating a dilation technique along with a low-pass operation.
Our method successfully balances the structural integrity and fidelity of generated images, achieving an astonishing capacity of arbitrary-size, high-resolution, and high-quality generation.
arXiv Detail & Related papers (2024-03-19T17:59:33Z) - Make a Cheap Scaling: A Self-Cascade Diffusion Model for
Higher-Resolution Adaptation [112.08287900261898]
This paper proposes a novel self-cascade diffusion model for rapid adaptation to higher-resolution image and video generation.
Our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters.
Experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.
arXiv Detail & Related papers (2024-02-16T07:48:35Z) - HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models [13.68666823175341]
HiDiffusion is a tuning-free higher-resolution framework for image synthesis.
RAU-Net dynamically adjusts the feature map size to resolve object duplication.
MSW-MSA engages optimized window attention to reduce computations.
arXiv Detail & Related papers (2023-11-29T11:01:38Z) - ACDMSR: Accelerated Conditional Diffusion Models for Single Image
Super-Resolution [84.73658185158222]
We propose a diffusion model-based super-resolution method called ACDMSR.
Our method adapts the standard diffusion model to perform super-resolution through a deterministic iterative denoising process.
Our approach generates more visually realistic counterparts for low-resolution images, emphasizing its effectiveness in practical scenarios.
arXiv Detail & Related papers (2023-07-03T06:49:04Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z) - Cascaded Diffusion Models for High Fidelity Image Generation [53.57766722279425]
We show that cascaded diffusion models are capable of generating high fidelity images on the class-conditional ImageNet generation challenge.
A cascaded diffusion model comprises a pipeline of multiple diffusion models that generate images of increasing resolution.
We find that the sample quality of a cascading pipeline relies crucially on conditioning augmentation.
arXiv Detail & Related papers (2021-05-30T17:14:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.