Related papers: DemoFusion: Democratising High-Resolution Image Generation With No $$$

DemoFusion: Democratising High-Resolution Image Generation With No $$$

URL: http://arxiv.org/abs/2311.16973v2
Date: Fri, 15 Dec 2023 02:15:29 GMT
Title: DemoFusion: Democratising High-Resolution Image Generation With No $$$
Authors: Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, Zhanyu Ma
Abstract summary: High-resolution image generation with Generative Artificial Intelligence (GenAI) has immense potential but, due to the enormous capital investment required for training, it is increasingly centralised to a few large corporations. This paper aims to democratise high-resolution GenAI by advancing the frontier of high-resolution generation while remaining accessible to a broad audience.
Score: 75.38688090593867
License: http://creativecommons.org/licenses/by/4.0/
Abstract: High-resolution image generation with Generative Artificial Intelligence (GenAI) has immense potential but, due to the enormous capital investment required for training, it is increasingly centralised to a few large corporations, and hidden behind paywalls. This paper aims to democratise high-resolution GenAI by advancing the frontier of high-resolution generation while remaining accessible to a broad audience. We demonstrate that existing Latent Diffusion Models (LDMs) possess untapped potential for higher-resolution image generation. Our novel DemoFusion framework seamlessly extends open-source GenAI models, employing Progressive Upscaling, Skip Residual, and Dilated Sampling mechanisms to achieve higher-resolution image generation. The progressive nature of DemoFusion requires more passes, but the intermediate results can serve as "previews", facilitating rapid prompt iteration.

Related papers

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model [87.23753533733046]
We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities.<n>Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder.
arXiv Detail & Related papers (2025-05-29T16:15:48Z)
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z)
Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation [25.29877217341663]
We introduce R0, a novel conditional generation approach via regularized reward. We train state-of-the-art few-step text-to-image generative models with R0 at scales. Our results challenge the conventional wisdom of diffusion post-training and conditional generation.
arXiv Detail & Related papers (2025-03-17T11:21:43Z)
Dual Diffusion for Unified Image Generation and Understanding [32.7554623473768]
We propose a large-scale and fully end-to-end diffusion model for multi-modal understanding and generation. We leverage a cross-modal maximum likelihood estimation framework that simultaneously trains the conditional likelihoods of both images and text jointly. Our model attained competitive performance compared to recent unified image understanding and generation models.
arXiv Detail & Related papers (2024-12-31T05:49:00Z)
Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective [52.778766190479374]
Latent-based image generative models have achieved notable success in image generation tasks. Despite sharing the same latent space, autoregressive models significantly lag behind LDMs and MIMs in image generation. We propose a simple but effective discrete image tokenizer to stabilize the latent space for image generative modeling.
arXiv Detail & Related papers (2024-10-16T12:13:17Z)
Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which non-autoregressive masked image modeling (MIM) text-to-image elevates to a level comparable with state-of-the-art diffusion models like SDXL. We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution. Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z)
MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning [38.560064789022704]
MegaFusion extends existing diffusion-based text-to-image models towards efficient higher-resolution generation. We employ an innovative truncate and relay strategy to bridge the denoising processes across different resolutions. By integrating dilated convolutions and noise re-scheduling, we further adapt the model's priors for higher resolution.
arXiv Detail & Related papers (2024-08-20T16:53:34Z)
DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance [11.44012694656102]
Large-scale generative models, such as text-to-image diffusion models, have garnered widespread attention across diverse domains. Existing large-scale diffusion models are confined to generating images of up to 1K resolution. We propose a novel progressive approach that fully utilizes generated low-resolution images to guide the generation of higher-resolution images.
arXiv Detail & Related papers (2024-06-26T16:10:31Z)
DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis [56.849285913695184]
Diffusion Mamba (DiM) is a sequence model for efficient high-resolution image synthesis. DiM architecture achieves inference-time efficiency for high-resolution images. Experiments demonstrate the effectiveness and efficiency of our DiM.
arXiv Detail & Related papers (2024-05-23T06:53:18Z)
On the Challenges and Opportunities in Generative AI [135.2754367149689]
We argue that current large-scale generative AI models do not sufficiently address several fundamental issues that hinder their widespread adoption across domains. In this work, we aim to identify key unresolved challenges in modern generative AI paradigms that should be tackled to further enhance their capabilities, versatility, and reliability.
arXiv Detail & Related papers (2024-02-28T15:19:33Z)
Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation [112.08287900261898]
This paper proposes a novel self-cascade diffusion model for rapid adaptation to higher-resolution image and video generation. Our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters. Experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.
arXiv Detail & Related papers (2024-02-16T07:48:35Z)
RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model [93.8067369210696]
Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions. Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps. In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models.
arXiv Detail & Related papers (2023-09-02T03:27:20Z)
A Bayesian Non-parametric Approach to Generative Models: Integrating Variational Autoencoder and Generative Adversarial Networks using Wasserstein and Maximum Mean Discrepancy [2.966338139852619]
Generative adversarial networks (GANs) and variational autoencoders (VAEs) are two of the most prominent and widely studied generative models. We employ a Bayesian non-parametric (BNP) approach to merge GANs and VAEs. By fusing the discriminative power of GANs with the reconstruction capabilities of VAEs, our novel model achieves superior performance in various generative tasks.
arXiv Detail & Related papers (2023-08-27T08:58:31Z)
High-Resolution Image Synthesis with Latent Diffusion Models [14.786952412297808]
Training diffusion models on autoencoders allows for the first time to reach a near-optimal point between complexity reduction and detail preservation. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks.
arXiv Detail & Related papers (2021-12-20T18:55:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.