Laminating Representation Autoencoders for Efficient Diffusion
- URL: http://arxiv.org/abs/2602.04873v1
- Date: Wed, 04 Feb 2026 18:57:33 GMT
- Title: Laminating Representation Autoencoders for Efficient Diffusion
- Authors: Ramón Calvo-González, François Fleuret,
- Abstract summary: Recent work has shown that diffusion models can generate high-quality images by operating directly on SSL patch features rather than pixel-space latents.<n>We introduce FlatDINO, a variational autoencoder that compresses this representation into a one-dimensional sequence of just 32 continuous tokens.
- Score: 18.989001805139573
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work has shown that diffusion models can generate high-quality images by operating directly on SSL patch features rather than pixel-space latents. However, the dense patch grids from encoders like DINOv2 contain significant redundancy, making diffusion needlessly expensive. We introduce FlatDINO, a variational autoencoder that compresses this representation into a one-dimensional sequence of just 32 continuous tokens -an 8x reduction in sequence length and 48x compression in total dimensionality. On ImageNet 256x256, a DiT-XL trained on FlatDINO latents achieves a gFID of 1.80 with classifier-free guidance while requiring 8x fewer FLOPs per forward pass and up to 4.5x fewer FLOPs per training step compared to diffusion on uncompressed DINOv2 features. These are preliminary results and this work is in progress.
Related papers
- YODA: Yet Another One-step Diffusion-based Video Compressor [55.356234617448905]
One-step diffusion models have recently excelled in perceptual image compression, their application to video remains limited.<n>We present YYet-One-step Diffusion-based Video which embeds multiscale features from temporal references for both latent generation and latent coding to better exploit spatial correlations.<n>YODA achieves state-of-the-art perceptual performance, consistently outperforming deep-learning baselines on LPIPS, DISTS, FID, and KID.
arXiv Detail & Related papers (2026-01-03T10:12:07Z) - DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation [93.6273078684831]
We propose a frequency-DeCoupled pixel diffusion framework to pursue a more efficient pixel diffusion paradigm.<n>With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance.<n>Experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet.
arXiv Detail & Related papers (2025-11-24T17:59:06Z) - DiP: Taming Diffusion Models in Pixel Space [91.51011771517683]
Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction.<n>Co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details.
arXiv Detail & Related papers (2025-11-24T06:55:49Z) - One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models [45.92038137978053]
We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code.<n>LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages.<n>A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines.
arXiv Detail & Related papers (2025-11-13T18:54:18Z) - StableCodec: Taming One-Step Diffusion for Extreme Image Compression [19.69733852050049]
Diffusion-based image compression has shown remarkable potential for achieving ultra-low coding (less than 0.05 bits per pixel) with high realism.<n>Current approaches require a large number of denoising steps at the decoder to generate realistic results under extreme constraints.<n>We introduce StableCodec, which enables one-step diffusion for high-fidelity and high-realism extreme image compression.
arXiv Detail & Related papers (2025-06-27T07:39:21Z) - DDT: Decoupled Diffusion Transformer [51.84206763079382]
Diffusion transformers encode noisy inputs to extract semantic component and decode higher frequency with identical modules.<n>textbfcolorddtDecoupled textbfcolorddtTransformer(textbfcolorddtDDT)<n>textbfcolorddtTransformer(textbfcolorddtDDT)<n>textbfcolorddtTransformer(textbfcolorddtDDT)
arXiv Detail & Related papers (2025-04-08T07:17:45Z) - Representing 3D Shapes With 64 Latent Vectors for 3D Diffusion Models [21.97308739556984]
COD-VAE encodes 3D shapes into a COmpact set of 1D latent vectors.<n>We show that COD-VAE achieves 16x compression compared to the baseline while maintaining quality.
arXiv Detail & Related papers (2025-03-11T06:29:39Z) - An Image is Worth 32 Tokens for Reconstruction and Generation [54.24414696392026]
Transformer-based 1-Dimensional Tokenizer (TiTok) is an innovative approach that tokenizes images into 1D latent sequences.
TiTok achieves competitive performance to state-of-the-art approaches.
Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.
arXiv Detail & Related papers (2024-06-11T17:59:56Z) - PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher [55.22994720855957]
PaGoDA is a novel pipeline that reduces the training costs through three stages: training diffusion on downsampled data, distilling the pretrained diffusion, and progressive super-resolution.
With the proposed pipeline, PaGoDA achieves a $64times$ reduced cost in training its diffusion model on 8x downsampled data.
PaGoDA's pipeline can be applied directly in the latent space, adding compression alongside the pre-trained autoencoder in Latent Diffusion Models.
arXiv Detail & Related papers (2024-05-23T17:39:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.