Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization
- URL: http://arxiv.org/abs/2503.11056v2
- Date: Wed, 02 Apr 2025 18:40:41 GMT
- Title: Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization
- Authors: Kyle Sargent, Kyle Hsu, Justin Johnson, Li Fei-Fei, Jiajun Wu,
- Abstract summary: FlowMo is a transformer-based diffusion autoencoder that achieves a new state-of-the-art for image tokenization at multiple compression rates.<n>Our key insight is that FlowMo training should be broken into a mode-matching pre-training stage and a mode-seeking post-training stage.
- Score: 28.089274647643716
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Since the advent of popular visual generation frameworks like VQGAN and latent diffusion models, state-of-the-art image generation systems have generally been two-stage systems that first tokenize or compress visual data into a lower-dimensional latent space before learning a generative model. Tokenizer training typically follows a standard recipe in which images are compressed and reconstructed subject to a combination of MSE, perceptual, and adversarial losses. Diffusion autoencoders have been proposed in prior work as a way to learn end-to-end perceptually-oriented image compression, but have not yet shown state-of-the-art performance on the competitive task of ImageNet-1K reconstruction. We propose FlowMo, a transformer-based diffusion autoencoder that achieves a new state-of-the-art for image tokenization at multiple compression rates without using convolutions, adversarial losses, spatially-aligned two-dimensional latent codes, or distilling from other tokenizers. Our key insight is that FlowMo training should be broken into a mode-matching pre-training stage and a mode-seeking post-training stage. In addition, we conduct extensive analyses and explore the training of generative models atop the FlowMo tokenizer. Our code and models will be available at http://kylesargent.github.io/flowmo .
Related papers
- Multi-Scale Invertible Neural Network for Wide-Range Variable-Rate Learned Image Compression [90.59962443790593]
In this paper, we present a variable-rate image compression model based on invertible transform to overcome limitations.
Specifically, we design a lightweight multi-scale invertible neural network, which maps the input image into multi-scale latent representations.
Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared to existing variable-rate methods.
arXiv Detail & Related papers (2025-03-27T09:08:39Z) - Diffusion Autoencoders are Scalable Image Tokenizers [48.22793874381871]
Tokenizing images into compact visual representations is a key step in learning efficient and high-quality image generative models.<n>We present a simple diffusion tokenizer (DiTo) that learns compact visual representations for image generation models.
arXiv Detail & Related papers (2025-01-30T18:59:37Z) - CALLIC: Content Adaptive Learning for Lossless Image Compression [64.47244912937204]
CALLIC sets a new state-of-the-art (SOTA) for learned lossless image compression.<n>We propose a content-aware autoregressive self-attention mechanism by leveraging convolutional gating operations.<n>During encoding, we decompose pre-trained layers, including depth-wise convolutions, using low-rank matrices and then adapt the incremental weights on testing image by Rate-guided Progressive Fine-Tuning (RPFT)<n>RPFT fine-tunes with gradually increasing patches that are sorted in descending order by estimated entropy, optimizing learning process and reducing adaptation time.
arXiv Detail & Related papers (2024-12-23T10:41:18Z) - PSC: Posterior Sampling-Based Compression [34.50287066865267]
Posterior Sampling-based Compression (PSC) is a zero-shot compression method that leverages a pre-trained diffusion model as its sole neural network component.<n> PSC constructs a transform that is adaptive to the image.<n>We demonstrate that PSC's performance is comparable to established training-based methods in terms of rate, distortion, and perceptual quality.
arXiv Detail & Related papers (2024-07-13T14:24:22Z) - Transferable Learned Image Compression-Resistant Adversarial Perturbations [66.46470251521947]
Adversarial attacks can readily disrupt the image classification system, revealing the vulnerability of DNN-based recognition tasks.
We introduce a new pipeline that targets image classification models that utilize learned image compressors as pre-processing modules.
arXiv Detail & Related papers (2024-01-06T03:03:28Z) - Not All Image Regions Matter: Masked Vector Quantization for
Autoregressive Image Generation [78.13793505707952]
Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook.
We propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) Stack model from modeling redundancy.
arXiv Detail & Related papers (2023-05-23T02:15:53Z) - Diffusion Models as Masked Autoencoders [52.442717717898056]
We revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models.
While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE)
We perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders.
arXiv Detail & Related papers (2023-04-06T17:59:56Z) - Lossy Image Compression with Conditional Diffusion Models [25.158390422252097]
This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models.
In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model.
Our approach yields stronger reported FID scores than the GAN-based model, while also yielding competitive performance with VAE-based models in several distortion metrics.
arXiv Detail & Related papers (2022-09-14T21:53:27Z) - Learned Image Compression with Gaussian-Laplacian-Logistic Mixture Model
and Concatenated Residual Modules [22.818632387206257]
Two key components of learned image compression are the entropy model of the latent representations and the encoding/decoding network architectures.
We propose a more flexible discretized Gaussian-Laplacian-Logistic mixture model (GLLMM) for the latent representations.
In the encoding/decoding network design part, we propose a residual blocks (CRB) where multiple residual blocks are serially connected with additional shortcut connections.
arXiv Detail & Related papers (2021-07-14T02:54:22Z) - Lossless Compression with Latent Variable Models [4.289574109162585]
We use latent variable models, which we call 'bits back with asymmetric numeral systems' (BB-ANS)
The method involves interleaving encode and decode steps, and achieves an optimal rate when compressing batches of data.
We describe 'Craystack', a modular software framework which we have developed for rapid prototyping of compression using deep generative models.
arXiv Detail & Related papers (2021-04-21T14:03:05Z) - Learning to Learn to Compress [25.23586503813838]
We present an end-to-end meta-learned system for image compression.
We propose a new training paradigm for learned image compression based on meta-learning.
arXiv Detail & Related papers (2020-07-31T13:13:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.