Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models
- URL: http://arxiv.org/abs/2503.18352v2
- Date: Fri, 28 Mar 2025 04:51:44 GMT
- Title: Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models
- Authors: Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, Di Huang,
- Abstract summary: Diffusion-4K is a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models.<n>We construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation.<n>We propose a wavelet-based fine-tuning approach for direct training with 4K images, applicable to various latent diffusion models.
- Score: 21.46605047406198
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present Diffusion-4K, a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models. The core advancements include: (1) Aesthetic-4K Benchmark: addressing the absence of a publicly available 4K image synthesis dataset, we construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation. We curated a high-quality 4K dataset with carefully selected images and captions generated by GPT-4o. Additionally, we introduce GLCM Score and Compression Ratio metrics to evaluate fine details, combined with holistic measures such as FID, Aesthetics and CLIPScore for a comprehensive assessment of ultra-high-resolution images. (2) Wavelet-based Fine-tuning: we propose a wavelet-based fine-tuning approach for direct training with photorealistic 4K images, applicable to various latent diffusion models, demonstrating its effectiveness in synthesizing highly detailed 4K images. Consequently, Diffusion-4K achieves impressive performance in high-quality image synthesis and text prompt adherence, especially when powered by modern large-scale diffusion models (e.g., SD3-2B and Flux-12B). Extensive experimental results from our benchmark demonstrate the superiority of Diffusion-4K in ultra-high-resolution image synthesis.
Related papers
- Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency [49.875459658889355]
Free4D is a tuning-free framework for 4D scene generation from a single image.
Our key insight is to distill pre-trained foundation models for consistent 4D scene representation.
The resulting 4D representation enables real-time, controllable rendering.
arXiv Detail & Related papers (2025-03-26T17:59:44Z) - IV-Mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis [22.79121512759783]
IV-Mixed Sampler is a novel training-free algorithm for video diffusion models.
It uses IDMs to enhance the quality of each video frame and VDMs to ensure the temporal coherence of the video during the sampling process.
It achieves state-of-the-art performance on four benchmarks including UCF-101-FVD, MSR-VTT-FVD, Chronomagic-Bench-150, and Chronomagic-Bench-1649.
arXiv Detail & Related papers (2024-10-05T14:33:28Z) - 4K4D: Real-Time 4D View Synthesis at 4K Resolution [86.6582179227016]
This paper targets high-fidelity and real-time view of dynamic 3D scenes at 4K resolution.
We propose a 4D point cloud representation that supports hardwareization and enables unprecedented rendering speed.
Our representation can be rendered at over 400 FPS on the DNA-Rendering dataset at 1080p resolution and 80 FPS on the ENeRF-Outdoor dataset at 4K resolution using an 4090 GPU.
arXiv Detail & Related papers (2023-10-17T17:57:38Z) - ACDMSR: Accelerated Conditional Diffusion Models for Single Image
Super-Resolution [84.73658185158222]
We propose a diffusion model-based super-resolution method called ACDMSR.
Our method adapts the standard diffusion model to perform super-resolution through a deterministic iterative denoising process.
Our approach generates more visually realistic counterparts for low-resolution images, emphasizing its effectiveness in practical scenarios.
arXiv Detail & Related papers (2023-07-03T06:49:04Z) - Probabilistic-based Feature Embedding of 4-D Light Fields for
Compressive Imaging and Denoising [62.347491141163225]
4-D light field (LF) poses great challenges in achieving efficient and effective feature embedding.
We propose a probabilistic-based feature embedding (PFE), which learns a feature embedding architecture by assembling various low-dimensional convolution patterns.
Our experiments demonstrate the significant superiority of our methods on both real-world and synthetic 4-D LF images.
arXiv Detail & Related papers (2023-06-15T03:46:40Z) - 4K-HAZE: A Dehazing Benchmark with 4K Resolution Hazy and Haze-Free
Images [12.402054374952485]
We develop a novel method to simulate 4K hazy images from clear images, which first estimates the scene depth, simulates the light rays and object reflectance, then migrates the synthetic images to real domains by using a GAN.
We wrap these synthesized images into a benchmark called the 4K-HAZE dataset.
The most appealing aspect of our approach is the capability to run a 4K image on a single GPU with 24G RAM in real-time (33fps)
arXiv Detail & Related papers (2023-03-28T09:39:29Z) - 4K-NeRF: High Fidelity Neural Radiance Fields at Ultra High Resolutions [19.380248980850727]
We present a novel and effective framework, named 4K-NeRF, to pursue high fidelity view synthesis on the challenging scenarios of ultra high resolutions.
We address the issue by exploring ray correlation to enhance high-frequency details recovery.
Our method can significantly boost rendering quality on high-frequency details compared with modern NeRF methods, and achieve the state-of-the-art visual quality on 4K ultra-high-resolution scenarios.
arXiv Detail & Related papers (2022-12-09T07:26:49Z) - Towards Efficient and Scale-Robust Ultra-High-Definition Image
Demoireing [71.62289021118983]
We present an efficient baseline model ESDNet for tackling 4K moire images, wherein we build a semantic-aligned scale-aware module to address the scale variation of moire patterns.
Our approach outperforms state-of-the-art methods by a large margin while being much more lightweight.
arXiv Detail & Related papers (2022-07-20T14:20:52Z) - DiVAE: Photorealistic Images Synthesis with Denoising Diffusion Decoder [73.1010640692609]
We propose a VQ-VAE architecture model with a diffusion decoder (DiVAE) to work as the reconstructing component in image synthesis.
Our model achieves state-of-the-art results and generates more photorealistic images specifically.
arXiv Detail & Related papers (2022-06-01T10:39:12Z) - One-shot Ultra-high-Resolution Generative Adversarial Network That
Synthesizes 16K Images On A Single GPU [1.9060575156739825]
OUR-GAN is a one-shot generative adversarial network framework that generates non-repetitive 16K images from a single training image.
OUR-GAN can synthesize high-quality 16K images with 12.5 GB of GPU memory and 4K images with only 4.29 GB.
OUR-GAN is the first one-shot image synthesizer that generates non-repetitive UHR images on a single consumer GPU.
arXiv Detail & Related papers (2022-02-28T13:48:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.