MSF: Efficient Diffusion Model Via Multi-Scale Latent Factorize
- URL: http://arxiv.org/abs/2501.13349v1
- Date: Thu, 23 Jan 2025 03:18:23 GMT
- Title: MSF: Efficient Diffusion Model Via Multi-Scale Latent Factorize
- Authors: Haohang Xu, Longyu Chen, Shuangrui Ding, Yilin Gao, Dongsheng Jiang, Yin Li, Shugong Xu, Junqing Yu, Wei Yang,
- Abstract summary: We propose a multiscale diffusion framework that generates hierarchical visual representations, which are subsequently integrated to form the final output.
Our method achieves an FID of 2.2 and an IS of 255.4 on the ImageNet 256x256 benchmark, reducing computational costs by 50% compared to baseline methods.
- Score: 27.749096921628457
- License:
- Abstract: Diffusion-based generative models have achieved remarkable progress in visual content generation. However, traditional diffusion models directly denoise the entire image from noisy inputs, disregarding the hierarchical structure present in visual signals. This method is computationally intensive, especially for high-resolution image generation. Signal processing often leverages hierarchical decompositions; for instance, Fourier analysis decomposes signals by frequency, while wavelet analysis captures localized frequency components, reflecting both spatial and frequency information simultaneously. Inspired by these principles, we propose a multiscale diffusion framework that generates hierarchical visual representations, which are subsequently integrated to form the final output. The diffusion model target, whether raw RGB pixels or latent features from a Variational Autoencoder, s divided into multiple components that each capture distinct spatial levels. The low-resolution component contains the primary informative signal, while higher-resolution components add high-frequency details, such as texture. This approach divides image generation into two stages: producing a low-resolution base signal, followed by a high-resolution residual signal. Both stages can be effectively modeled using simpler, lightweight transformer architectures compared to full-resolution generation. This decomposition is conceptually similar to wavelet decomposition but offers a more streamlined and intuitive design. Our method, termed MSF(short for Multi-Scale Factorization), achieves an FID of 2.2 and an IS of 255.4 on the ImageNet 256x256 benchmark, reducing computational costs by 50% compared to baseline methods.
Related papers
- One Diffusion Step to Real-World Super-Resolution via Flow Trajectory Distillation [60.54811860967658]
FluxSR is a novel one-step diffusion Real-ISR based on flow matching models.
First, we introduce Flow Trajectory Distillation (FTD) to distill a multi-step flow matching model into a one-step Real-ISR.
Second, to improve image realism and address high-frequency artifact issues in generated images, we propose TV-LPIPS as a perceptual loss.
arXiv Detail & Related papers (2025-02-04T04:11:29Z) - Effective Diffusion Transformer Architecture for Image Super-Resolution [63.254644431016345]
We design an effective diffusion transformer for image super-resolution (DiT-SR)
In practice, DiT-SR leverages an overall U-shaped architecture, and adopts a uniform isotropic design for all the transformer blocks.
We analyze the limitation of the widely used AdaLN, and present a frequency-adaptive time-step conditioning module.
arXiv Detail & Related papers (2024-09-29T07:14:16Z) - Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization [26.926712014346432]
This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization.
Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, setting new state-of-the-art FID scores of 1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512.
arXiv Detail & Related papers (2024-06-13T17:59:58Z) - Diffusion-Aided Joint Source Channel Coding For High Realism Wireless Image Transmission [24.372996233209854]
DiffJSCC is a novel framework that produces high-realism images via the conditional diffusion denoising process.
It can achieve highly realistic reconstructions for 768x512 pixel Kodak images with only 3072 symbols.
arXiv Detail & Related papers (2024-04-27T00:12:13Z) - Improving Pixel-based MIM by Reducing Wasted Modeling Capability [77.99468514275185]
We propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction.
To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures.
Our method yields significant performance gains, such as 1.2% on fine-tuning, 2.8% on linear probing, and 2.6% on semantic segmentation.
arXiv Detail & Related papers (2023-08-01T03:44:56Z) - Spatial-Frequency U-Net for Denoising Diffusion Probabilistic Models [89.76587063609806]
We study the denoising diffusion probabilistic model (DDPM) in wavelet space, instead of pixel space, for visual synthesis.
By explicitly modeling the wavelet signals, we find our model is able to generate images with higher quality on several datasets.
arXiv Detail & Related papers (2023-07-27T06:53:16Z) - Dimensionality-Varying Diffusion Process [52.52681373641533]
Diffusion models learn to reverse a signal destruction process to generate new data.
We make a theoretical generalization of the forward diffusion process via signal decomposition.
We show that our strategy facilitates high-resolution image synthesis and improves FID of diffusion model trained on FFHQ at $1024times1024$ resolution from 52.40 to 10.46.
arXiv Detail & Related papers (2022-11-29T09:05:55Z) - Multi-scale frequency separation network for image deblurring [10.511076996096117]
We present a new method called multi-scale frequency separation network (MSFS-Net) for image deblurring.
MSFS-Net captures the low and high-frequency information of image at multiple scales.
Experiments on benchmark datasets show that the proposed network achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-06-01T23:48:35Z) - TWIST-GAN: Towards Wavelet Transform and Transferred GAN for
Spatio-Temporal Single Image Super Resolution [4.622977798361014]
Single Image Super-resolution (SISR) produces high-resolution images with fine spatial resolutions from a remotely sensed image with low spatial resolution.
Deep learning and generative adversarial networks (GANs) have made breakthroughs for the challenging task of single image super-resolution (SISR)
arXiv Detail & Related papers (2021-04-20T22:12:38Z) - Modulated Periodic Activations for Generalizable Local Functional
Representations [113.64179351957888]
We present a new representation that generalizes to multiple instances and achieves state-of-the-art fidelity.
Our approach produces general functional representations of images, videos and shapes, and achieves higher reconstruction quality than prior works that are optimized for a single signal.
arXiv Detail & Related papers (2021-04-08T17:59:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.