One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models
- URL: http://arxiv.org/abs/2511.10629v1
- Date: Fri, 14 Nov 2025 02:00:38 GMT
- Title: One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models
- Authors: Aleksandr Razin, Danil Kazantsev, Ilya Makarov,
- Abstract summary: We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code.<n>LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages.<n>A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines.
- Score: 45.92038137978053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.
Related papers
- DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation [93.6273078684831]
We propose a frequency-DeCoupled pixel diffusion framework to pursue a more efficient pixel diffusion paradigm.<n>With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance.<n>Experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet.
arXiv Detail & Related papers (2025-11-24T17:59:06Z) - DiP: Taming Diffusion Models in Pixel Space [91.51011771517683]
Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction.<n>Co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details.
arXiv Detail & Related papers (2025-11-24T06:55:49Z) - Diffusion Transformers with Representation Autoencoders [35.43400861279246]
Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT)<n>Most DiTs continue to rely on the original VAE encoder, which introduces several limitations.<n>In this work, we explore replacing the VAE with pretrained representation encoders paired with trained decoders, forming what we term Representation Autoencoders (RAE)
arXiv Detail & Related papers (2025-10-13T17:51:39Z) - High-resolution Photo Enhancement in Real-time: A Laplacian Pyramid Network [73.19214585791268]
This paper introduces a pyramid network called LLF-LUT++, which integrates global and local operators through closed-form Laplacian pyramid decomposition and reconstruction.<n>Specifically, we utilize an image-adaptive 3D LUT that capitalizes on the global tonal characteristics of downsampled images.<n>LLF-LUT++ not only achieves a 2.64 dB improvement in PSNR on the HDR+ dataset, but also further reduces, with 4K resolution images processed in just 13 ms on a single GPU.
arXiv Detail & Related papers (2025-10-13T16:52:32Z) - SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization [56.12853087022071]
We introduce a new pixel diffusion decoder architecture for improved scaling and training stability.<n>We use distillation to replicate the performance of the diffusion decoder in an efficient single-step decoder.<n>This makes SSDD the first diffusion decoder optimized for single-step reconstruction trained without adversarial losses.
arXiv Detail & Related papers (2025-10-06T15:57:31Z) - InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis [51.81849724354083]
Current diffusion models increase computational demand quadratically with resolution, causing 4K image generation delays over 100 seconds.<n>We propose to decode arbitrary resolution images with a compact generated latent using a one-step generator.<n>InfGen is capable of improving many models into the arbitrary high-resolution era while cutting 4K image generation time to under 10 seconds.
arXiv Detail & Related papers (2025-09-12T17:48:57Z) - Pixel to Gaussian: Ultra-Fast Continuous Super-Resolution with 2D Gaussian Modeling [50.34513854725803]
Arbitrary-scale super-resolution (ASSR) aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs with arbitrary upsampling factors.<n>We propose a novel ContinuousSR framework with a Pixel-to-Gaussian paradigm, which explicitly reconstructs 2D continuous HR signals from LR images using Gaussian Splatting.
arXiv Detail & Related papers (2025-03-09T13:43:57Z) - Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion [34.70370851239368]
We show that pixel-space models can be very competitive to latent models both in quality and efficiency.<n>We present a simple recipe for scaling end-to-end pixel-space diffusion models to high resolutions.
arXiv Detail & Related papers (2024-10-25T06:20:06Z) - LinFusion: 1 GPU, 1 Minute, 16K Image [71.44735417472043]
We introduce a low-rank approximation of a wide spectrum of popular linear token mixers.
We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD.
Experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation.
arXiv Detail & Related papers (2024-09-03T17:54:39Z) - Image Super-resolution Via Latent Diffusion: A Sampling-space Mixture Of
Experts And Frequency-augmented Decoder Approach [17.693287544860638]
latent-based diffusion for image super-resolution improved by pre-trained text-image models.
latent-based methods utilize a feature encoder to transform the image and then implement the SR image generation in a compact latent space.
We propose a frequency compensation module that enhances the frequency components from latent space to pixel space.
arXiv Detail & Related papers (2023-10-18T14:39:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.