Make a Cheap Scaling: A Self-Cascade Diffusion Model for
Higher-Resolution Adaptation
- URL: http://arxiv.org/abs/2402.10491v1
- Date: Fri, 16 Feb 2024 07:48:35 GMT
- Title: Make a Cheap Scaling: A Self-Cascade Diffusion Model for
Higher-Resolution Adaptation
- Authors: Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun,
Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, Ying Shan,
Bihan Wen
- Abstract summary: This paper proposes a novel self-cascade diffusion model for rapid adaptation to higher-resolution image and video generation.
Our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters.
Experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.
- Score: 112.08287900261898
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models have proven to be highly effective in image and video
generation; however, they still face composition challenges when generating
images of varying sizes due to single-scale training data. Adapting large
pre-trained diffusion models for higher resolution demands substantial
computational and optimization resources, yet achieving a generation capability
comparable to low-resolution models remains elusive. This paper proposes a
novel self-cascade diffusion model that leverages the rich knowledge gained
from a well-trained low-resolution model for rapid adaptation to
higher-resolution image and video generation, employing either tuning-free or
cheap upsampler tuning paradigms. Integrating a sequence of multi-scale
upsampler modules, the self-cascade diffusion model can efficiently adapt to a
higher resolution, preserving the original composition and generation
capabilities. We further propose a pivot-guided noise re-schedule strategy to
speed up the inference process and improve local structural details. Compared
to full fine-tuning, our approach achieves a 5X training speed-up and requires
only an additional 0.002M tuning parameters. Extensive experiments demonstrate
that our approach can quickly adapt to higher resolution image and video
synthesis by fine-tuning for just 10k steps, with virtually no additional
inference time.
Related papers
- FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion [50.43304425256732]
FreeScale is a tuning-free inference paradigm to enable higher-resolution visual generation via scale fusion.
We extend the capabilities of higher-resolution visual generation for both image and video models.
arXiv Detail & Related papers (2024-12-12T18:59:59Z) - Accelerating Video Diffusion Models via Distribution Matching [26.475459912686986]
This work introduces a novel framework for diffusion distillation and distribution matching.
Our approach focuses on distilling pre-trained diffusion models into a more efficient few-step generator.
By leveraging a combination of video GAN loss and a novel 2D score distribution matching loss, we demonstrate the potential to generate high-quality video frames.
arXiv Detail & Related papers (2024-12-08T11:36:32Z) - Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which elevates non-autoregressive masked image modeling (MIM) text-to-image to a level comparable with state-of-the-art diffusion models like SDXL.
We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers.
Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z) - MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning [38.560064789022704]
MegaFusion extends existing diffusion-based text-to-image models towards efficient higher-resolution generation.
We employ an innovative truncate and relay strategy to bridge the denoising processes across different resolutions.
By integrating dilated convolutions and noise re-scheduling, we further adapt the model's priors for higher resolution.
arXiv Detail & Related papers (2024-08-20T16:53:34Z) - DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance [11.44012694656102]
Large-scale generative models, such as text-to-image diffusion models, have garnered widespread attention across diverse domains.
Existing large-scale diffusion models are confined to generating images of up to 1K resolution.
We propose a novel progressive approach that fully utilizes generated low-resolution images to guide the generation of higher-resolution images.
arXiv Detail & Related papers (2024-06-26T16:10:31Z) - Hierarchical Patch Diffusion Models for High-Resolution Video Generation [50.42746357450949]
We develop deep context fusion, which propagates context information from low-scale to high-scale patches in a hierarchical manner.
We also propose adaptive computation, which allocates more network capacity and computation towards coarse image details.
The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation.
arXiv Detail & Related papers (2024-06-12T01:12:53Z) - Upsample Guidance: Scale Up Diffusion Models without Training [0.0]
We introduce upsample guidance, a technique that adapts pretrained diffusion model to generate higher-resolution images.
Remarkably, this technique does not necessitate any additional training or relying on external models.
We demonstrate that upsample guidance can be applied to various models, such as pixel-space, latent space, and video diffusion models.
arXiv Detail & Related papers (2024-04-02T07:49:08Z) - FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis [48.9652334528436]
We introduce an innovative, training-free approach FouriScale from the perspective of frequency domain analysis.
We replace the original convolutional layers in pre-trained diffusion models by incorporating a dilation technique along with a low-pass operation.
Our method successfully balances the structural integrity and fidelity of generated images, achieving an astonishing capacity of arbitrary-size, high-resolution, and high-quality generation.
arXiv Detail & Related papers (2024-03-19T17:59:33Z) - ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with
Diffusion Models [126.35334860896373]
We investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes.
Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues.
We propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference.
arXiv Detail & Related papers (2023-10-11T17:52:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.