Related papers: FreCaS: Efficient Higher-Resolution Image Generation via Frequency-aware Cascaded Sampling

FreCaS: Efficient Higher-Resolution Image Generation via Frequency-aware Cascaded Sampling

URL: http://arxiv.org/abs/2410.18410v1
Date: Thu, 24 Oct 2024 03:56:44 GMT
Title: FreCaS: Efficient Higher-Resolution Image Generation via Frequency-aware Cascaded Sampling
Authors: Zhengqiang Zhang, Ruihuang Li, Lei Zhang,
Abstract summary: FreCaS decomposes the sampling process into cascaded stages with gradually increased resolutions and expanding frequency bands. FreCaS significantly outperforms state-of-the-art methods in image quality and generation speed.
Score: 13.275724439963188
License:
Abstract: While image generation with diffusion models has achieved a great success, generating images of higher resolution than the training size remains a challenging task due to the high computational cost. Current methods typically perform the entire sampling process at full resolution and process all frequency components simultaneously, contradicting with the inherent coarse-to-fine nature of latent diffusion models and wasting computations on processing premature high-frequency details at early diffusion stages. To address this issue, we introduce an efficient $\textbf{Fre}$quency-aware $\textbf{Ca}$scaded $\textbf{S}$ampling framework, $\textbf{FreCaS}$ in short, for higher-resolution image generation. FreCaS decomposes the sampling process into cascaded stages with gradually increased resolutions, progressively expanding frequency bands and refining the corresponding details. We propose an innovative frequency-aware classifier-free guidance (FA-CFG) strategy to assign different guidance strengths for different frequency components, directing the diffusion model to add new details in the expanded frequency domain of each stage. Additionally, we fuse the cross-attention maps of previous and current stages to avoid synthesizing unfaithful layouts. Experiments demonstrate that FreCaS significantly outperforms state-of-the-art methods in image quality and generation speed. In particular, FreCaS is about 2.86$\times$ and 6.07$\times$ faster than ScaleCrafter and DemoFusion in generating a 2048$\times$2048 image using a pre-trained SDXL model and achieves an FID$_b$ improvement of 11.6 and 3.7, respectively. FreCaS can be easily extended to more complex models such as SD3. The source code of FreCaS can be found at $\href{\text{https://github.com/xtudbxk/FreCaS}}{https://github.com/xtudbxk/FreCaS}$.

Related papers

MSF: Efficient Diffusion Model Via Multi-Scale Latent Factorize [27.749096921628457]
We propose a multiscale diffusion framework that generates hierarchical visual representations, which are subsequently integrated to form the final output. Our method achieves an FID of 2.2 and an IS of 255.4 on the ImageNet 256x256 benchmark, reducing computational costs by 50% compared to baseline methods.
arXiv Detail & Related papers (2025-01-23T03:18:23Z)
Diffusion Models Need Visual Priors for Image Generation [86.92260591389818]
Diffusion on Diffusion (DoD) is an innovative multi-stage generation framework that first extracts visual priors from previously generated samples, then provides rich guidance for the diffusion model. We evaluate DoD on the popular ImageNet-$256 times 256$ dataset, reducing 7$times$ training cost compared to SiT and DiT. Our largest model DoD-XL achieves an FID-50K score of 1.83 with only 1 million training steps, which surpasses other state-of-the-art methods without bells and whistles during inference.
arXiv Detail & Related papers (2024-10-11T05:03:56Z)
HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration [31.982294870690925]
We propose a novel learning-based caching framework dubbed HarmoniCa. It incorporates Step-Wise Denoising Training (SDT) to ensure the continuity of the denoising process. It also incorporates an Image Error Proxy-Guided Objective (IEPO) to balance image quality against cache utilization.
arXiv Detail & Related papers (2024-10-02T16:34:29Z)
Frequency-Domain Refinement with Multiscale Diffusion for Super Resolution [7.29314801047906]
We propose a novel Frequency Domain-guided multiscale Diffusion model (FDDiff) FDDiff decomposes the high-frequency information complementing process into finer-grained steps. We show that FDDiff outperforms prior generative methods with higher-fidelity super-resolution results.
arXiv Detail & Related papers (2024-05-16T11:58:52Z)
Cache Me if You Can: Accelerating Diffusion Models through Block Caching [67.54820800003375]
A large image-to-image network has to be applied many times to iteratively refine an image from random noise. We investigate the behavior of the layers within the network and find that 1) the layers' output changes smoothly over time, 2) the layers show distinct patterns of change, and 3) the change from step to step is often very small. We propose a technique to automatically determine caching schedules based on each block's changes over timesteps.
arXiv Detail & Related papers (2023-12-06T00:51:38Z)
ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models [59.90959789767886]
We show that optimizing consistency training loss minimizes the Wasserstein distance between target and generated distributions. By incorporating a discriminator into the consistency training framework, our method achieves improved FID scores on CIFAR10 and ImageNet 64$times$64 and LSUN Cat 256$times$256 datasets.
arXiv Detail & Related papers (2023-11-23T16:49:06Z)
Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images [56.17404812357676]
Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters composition problems when generating images of varying sizes. We propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed images of any size. We show that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2x compared to the traditional tiled algorithm.
arXiv Detail & Related papers (2023-08-31T09:27:56Z)
SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds [88.06788636008051]
Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. These models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. We present a generic approach that unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds.
arXiv Detail & Related papers (2023-06-01T17:59:25Z)
On the Importance of Noise Scheduling for Diffusion Models [8.360383061862844]
We study the effect of noise scheduling strategies for denoising diffusion generative models. This simple recipe yields state-of-the-art pixel-based diffusion models for high-resolution images on ImageNet.
arXiv Detail & Related papers (2023-01-26T07:37:22Z)
Wavelet Diffusion Models are fast and scalable Image Generators [3.222802562733787]
Diffusion models are a powerful solution for high-fidelity image generation, which exceeds GANs in quality in many circumstances. Recent DiffusionGAN method significantly decreases the models' running time by reducing the number of sampling steps from thousands to several, but their speeds still largely lag behind the GAN counterparts. This paper aims to reduce the speed gap by proposing a novel wavelet-based diffusion scheme. We extract low-and-high frequency components from both image and feature levels via wavelet decomposition and adaptively handle these components for faster processing while maintaining good generation quality.
arXiv Detail & Related papers (2022-11-29T12:25:25Z)
Unsupervised Single Image Super-resolution Under Complex Noise [60.566471567837574]
This paper proposes a model-based unsupervised SISR method to deal with the general SISR task with unknown degradations. The proposed method can evidently surpass the current state of the art (SotA) method (about 1dB PSNR) not only with a slighter model (0.34M vs. 2.40M) but also faster speed.
arXiv Detail & Related papers (2021-07-02T11:55:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.