Related papers: Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images

Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images

URL: http://arxiv.org/abs/2308.16582v2
Date: Mon, 11 Sep 2023 07:44:49 GMT
Title: Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images
Authors: Qingping Zheng, Yuanfan Guo, Jiankang Deng, Jianhua Han, Ying Li, Songcen Xu, Hang Xu
Abstract summary: Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters composition problems when generating images of varying sizes. We propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed images of any size. We show that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2x compared to the traditional tiled algorithm.
Score: 56.17404812357676
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters resolution-induced composition problems when generating images of varying sizes. This issue primarily stems from the model being trained on pairs of single-scale images and their corresponding text descriptions. Moreover, direct training on images of unlimited sizes is unfeasible, as it would require an immense number of text-image pairs and entail substantial computational expenses. To overcome these challenges, we propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed images of any size, while minimizing the need for high-memory GPU resources. Specifically, the initial stage, dubbed Any Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a restricted range of ratios to optimize the text-conditional diffusion model, thereby improving its ability to adjust composition to accommodate diverse image sizes. To support the creation of images at any desired size, we further introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the subsequent stage. This method allows for the rapid enlargement of the ASD output to any high-resolution size, avoiding seaming artifacts or memory overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks demonstrate that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2x compared to the traditional tiled algorithm.

Related papers

FlowTok: Flowing Seamlessly Across Text and Image Tokens [20.629139911638646]
FlowTok is a framework that seamlessly flows across text and images by encoding images into a compact 1D token representation. It reduces the latent space size by 3.3x at an image resolution of 256, eliminating the need for complex conditioning mechanisms or noise scheduling. With its streamlined architecture centered around compact 1D tokens, FlowTok is highly memory-efficient, requires significantly fewer training resources, and achieves much faster sampling speeds-all while delivering performance comparable to state-of-the-art models.
arXiv Detail & Related papers (2025-03-13T18:06:13Z)
ZoomLDM: Latent Diffusion Model for multi-scale image generation [57.639937071834986]
We present ZoomLDM, a diffusion model tailored for generating images across multiple scales. Central to our approach is a novel magnification-aware conditioning mechanism that utilizes self-supervised learning (SSL) embeddings. ZoomLDM achieves state-of-the-art image generation quality across all scales, excelling in the data-scarce setting of generating thumbnails of entire large images.
arXiv Detail & Related papers (2024-11-25T22:39:22Z)
One Diffusion to Generate Them All [54.82732533013014]
OneDiffusion is a versatile, large-scale diffusion model that supports bidirectional image synthesis and understanding. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps. OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs.
arXiv Detail & Related papers (2024-11-25T12:11:05Z)
Adapting Diffusion Models for Improved Prompt Compliance and Controllable Image Synthesis [43.481539150288434]
This work introduces a new family of. factor graph Diffusion Models (FG-DMs) FG-DMs models the joint distribution of. images and conditioning variables, such as semantic, sketch,. deep or normal maps via a factor graph decomposition.
arXiv Detail & Related papers (2024-10-29T00:54:00Z)
High-Precision Dichotomous Image Segmentation via Probing Diffusion Capacity [69.32473738284374]
We propose DiffDIS, a diffusion-driven segmentation model that taps into the potential of the pre-trained U-Net within diffusion models. By leveraging the robust generalization capabilities and rich, versatile image representation prior to the SD models, we significantly reduce the inference time while preserving high-fidelity, detailed generation. Experiments on the DIS5K dataset demonstrate the superiority of DiffDIS, achieving state-of-the-art results through a streamlined inference process.
arXiv Detail & Related papers (2024-10-14T02:49:23Z)
OrientDream: Streamlining Text-to-3D Generation with Explicit Orientation Control [66.03885917320189]
OrientDream is a camera orientation conditioned framework for efficient and multi-view consistent 3D generation from textual prompts. Our strategy emphasizes the implementation of an explicit camera orientation conditioned feature in the pre-training of a 2D text-to-image diffusion module. Our experiments reveal that our method not only produces high-quality NeRF models with consistent multi-view properties but also achieves an optimization speed significantly greater than existing methods.
arXiv Detail & Related papers (2024-06-14T13:16:18Z)
Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder [29.924160271522354]
Super-resolution (SR) and image generation are important tasks in computer vision and are widely adopted in real-world applications. Most existing methods, however, generate images only at fixed-scale magnification and suffer from over-smoothing and artifacts. Most relevant work applied Implicit Neural Representation (INR) to the denoising diffusion model to obtain continuous-resolution yet diverse and high-quality SR results. We propose a novel pipeline that can super-resolve an input image or generate from a random noise a novel image at arbitrary scales.
arXiv Detail & Related papers (2024-03-15T12:45:40Z)
Collaborative Score Distillation for Consistent Visual Synthesis [70.29294250371312]
Collaborative Score Distillation (CSD) is based on the Stein Variational Gradient Descent (SVGD) We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.
arXiv Detail & Related papers (2023-07-04T17:31:50Z)
Raising The Limit Of Image Rescaling Using Auxiliary Encoding [7.9700865143145485]
Recently, image rescaling models like IRN utilize the bidirectional nature of INN to push the performance limit of image upscaling. We propose auxiliary encoding modules to further push the limit of image rescaling performance.
arXiv Detail & Related papers (2023-03-12T20:49:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.