Related papers: Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization

Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization

URL: http://arxiv.org/abs/2503.11056v2
Date: Wed, 02 Apr 2025 18:40:41 GMT
Title: Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization
Authors: Kyle Sargent, Kyle Hsu, Justin Johnson, Li Fei-Fei, Jiajun Wu,
Abstract summary: FlowMo is a transformer-based diffusion autoencoder that achieves a new state-of-the-art for image tokenization at multiple compression rates.<n>Our key insight is that FlowMo training should be broken into a mode-matching pre-training stage and a mode-seeking post-training stage.
Score: 28.089274647643716
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Since the advent of popular visual generation frameworks like VQGAN and latent diffusion models, state-of-the-art image generation systems have generally been two-stage systems that first tokenize or compress visual data into a lower-dimensional latent space before learning a generative model. Tokenizer training typically follows a standard recipe in which images are compressed and reconstructed subject to a combination of MSE, perceptual, and adversarial losses. Diffusion autoencoders have been proposed in prior work as a way to learn end-to-end perceptually-oriented image compression, but have not yet shown state-of-the-art performance on the competitive task of ImageNet-1K reconstruction. We propose FlowMo, a transformer-based diffusion autoencoder that achieves a new state-of-the-art for image tokenization at multiple compression rates without using convolutions, adversarial losses, spatially-aligned two-dimensional latent codes, or distilling from other tokenizers. Our key insight is that FlowMo training should be broken into a mode-matching pre-training stage and a mode-seeking post-training stage. In addition, we conduct extensive analyses and explore the training of generative models atop the FlowMo tokenizer. Our code and models will be available at http://kylesargent.github.io/flowmo .

Related papers

Multi-Scale Invertible Neural Network for Wide-Range Variable-Rate Learned Image Compression [90.59962443790593]
In this paper, we present a variable-rate image compression model based on invertible transform to overcome limitations. Specifically, we design a lightweight multi-scale invertible neural network, which maps the input image into multi-scale latent representations. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared to existing variable-rate methods.
arXiv Detail & Related papers (2025-03-27T09:08:39Z)
Diffusion Autoencoders are Scalable Image Tokenizers [48.22793874381871]
Tokenizing images into compact visual representations is a key step in learning efficient and high-quality image generative models.<n>We present a simple diffusion tokenizer (DiTo) that learns compact visual representations for image generation models.
arXiv Detail & Related papers (2025-01-30T18:59:37Z)
CALLIC: Content Adaptive Learning for Lossless Image Compression [64.47244912937204]
CALLIC sets a new state-of-the-art (SOTA) for learned lossless image compression.<n>We propose a content-aware autoregressive self-attention mechanism by leveraging convolutional gating operations.<n>During encoding, we decompose pre-trained layers, including depth-wise convolutions, using low-rank matrices and then adapt the incremental weights on testing image by Rate-guided Progressive Fine-Tuning (RPFT)<n>RPFT fine-tunes with gradually increasing patches that are sorted in descending order by estimated entropy, optimizing learning process and reducing adaptation time.
arXiv Detail & Related papers (2024-12-23T10:41:18Z)
PSC: Posterior Sampling-Based Compression [34.50287066865267]
Posterior Sampling-based Compression (PSC) is a zero-shot compression method that leverages a pre-trained diffusion model as its sole neural network component.<n> PSC constructs a transform that is adaptive to the image.<n>We demonstrate that PSC's performance is comparable to established training-based methods in terms of rate, distortion, and perceptual quality.
arXiv Detail & Related papers (2024-07-13T14:24:22Z)
Transferable Learned Image Compression-Resistant Adversarial Perturbations [66.46470251521947]
Adversarial attacks can readily disrupt the image classification system, revealing the vulnerability of DNN-based recognition tasks. We introduce a new pipeline that targets image classification models that utilize learned image compressors as pre-processing modules.
arXiv Detail & Related papers (2024-01-06T03:03:28Z)
Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation [78.13793505707952]
Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook. We propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) Stack model from modeling redundancy.
arXiv Detail & Related papers (2023-05-23T02:15:53Z)
Diffusion Models as Masked Autoencoders [52.442717717898056]
We revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models. While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE) We perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders.
arXiv Detail & Related papers (2023-04-06T17:59:56Z)
Lossy Image Compression with Conditional Diffusion Models [25.158390422252097]
This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models. In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model. Our approach yields stronger reported FID scores than the GAN-based model, while also yielding competitive performance with VAE-based models in several distortion metrics.
arXiv Detail & Related papers (2022-09-14T21:53:27Z)
Learned Image Compression with Gaussian-Laplacian-Logistic Mixture Model and Concatenated Residual Modules [22.818632387206257]
Two key components of learned image compression are the entropy model of the latent representations and the encoding/decoding network architectures. We propose a more flexible discretized Gaussian-Laplacian-Logistic mixture model (GLLMM) for the latent representations. In the encoding/decoding network design part, we propose a residual blocks (CRB) where multiple residual blocks are serially connected with additional shortcut connections.
arXiv Detail & Related papers (2021-07-14T02:54:22Z)
Lossless Compression with Latent Variable Models [4.289574109162585]
We use latent variable models, which we call 'bits back with asymmetric numeral systems' (BB-ANS) The method involves interleaving encode and decode steps, and achieves an optimal rate when compressing batches of data. We describe 'Craystack', a modular software framework which we have developed for rapid prototyping of compression using deep generative models.
arXiv Detail & Related papers (2021-04-21T14:03:05Z)
Learning to Learn to Compress [25.23586503813838]
We present an end-to-end meta-learned system for image compression. We propose a new training paradigm for learned image compression based on meta-learning.
arXiv Detail & Related papers (2020-07-31T13:13:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.