Diffusion Autoencoders are Scalable Image Tokenizers
- URL: http://arxiv.org/abs/2501.18593v1
- Date: Thu, 30 Jan 2025 18:59:37 GMT
- Title: Diffusion Autoencoders are Scalable Image Tokenizers
- Authors: Yinbo Chen, Rohit Girdhar, Xiaolong Wang, Sai Saketh Rambhatla, Ishan Misra,
- Abstract summary: Tokenizing images into compact visual representations is a key step in learning efficient and high-quality image generative models.
We present a simple diffusion tokenizer (DiTo) that learns compact visual representations for image generation models.
- Score: 48.22793874381871
- License:
- Abstract: Tokenizing images into compact visual representations is a key step in learning efficient and high-quality image generative models. We present a simple diffusion tokenizer (DiTo) that learns compact visual representations for image generation models. Our key insight is that a single learning objective, diffusion L2 loss, can be used for training scalable image tokenizers. Since diffusion is already widely used for image generation, our insight greatly simplifies training such tokenizers. In contrast, current state-of-the-art tokenizers rely on an empirically found combination of heuristics and losses, thus requiring a complex training recipe that relies on non-trivially balancing different losses and pretrained supervised models. We show design decisions, along with theoretical grounding, that enable us to scale DiTo for learning competitive image representations. Our results show that DiTo is a simpler, scalable, and self-supervised alternative to the current state-of-the-art image tokenizer which is supervised. DiTo achieves competitive or better quality than state-of-the-art in image reconstruction and downstream image generation tasks.
Related papers
- Gen-SIS: Generative Self-augmentation Improves Self-supervised Learning [52.170253590364545]
Gen-SIS is a diffusion-based augmentation technique trained exclusively on unlabeled image data.
We show that these self-augmentations', i.e. generative augmentations based on the vanilla SSL encoder embeddings, facilitate the training of a stronger SSL encoder.
arXiv Detail & Related papers (2024-12-02T16:20:59Z) - Factorized Visual Tokenization and Generation [37.56136469262736]
We introduce Factorized Quantization (FQ), a novel approach that revitalizes VQ-based tokenizers by decomposing a large codebook into multiple independent sub-codebooks.
This factorization reduces the lookup complexity of large codebooks, enabling more efficient and scalable visual tokenization.
Experiments show that the proposed FQGAN model substantially improves the reconstruction quality of visual tokenizers, achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-11-25T18:59:53Z) - Diffusion-based image inpainting with internal learning [4.912318087940015]
We propose lightweight diffusion models for image inpainting that can be trained on a single image, or a few images.
We show that our approach competes with large state-of-the-art models in specific cases.
arXiv Detail & Related papers (2024-06-06T16:04:06Z) - Denoising Autoregressive Representation Learning [13.185567468951628]
Our method, DARL, employs a decoder-only Transformer to predict image patches autoregressively.
We show that the learned representation can be improved by using tailored noise schedules and longer training in larger models.
arXiv Detail & Related papers (2024-03-08T10:19:00Z) - Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis [65.7968515029306]
We propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS)
A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt.
arXiv Detail & Related papers (2024-02-28T06:07:07Z) - Transformer-based Clipped Contrastive Quantization Learning for
Unsupervised Image Retrieval [15.982022297570108]
Unsupervised image retrieval aims to learn the important visual characteristics without any given level to retrieve the similar images for a given query image.
In this paper, we propose a TransClippedCLR model by encoding the global context of an image using Transformer having local context through patch based processing.
Results using the proposed clipped contrastive learning are greatly improved on all datasets as compared to same backbone network with vanilla contrastive learning.
arXiv Detail & Related papers (2024-01-27T09:39:11Z) - Aligning Text-to-Image Diffusion Models with Reward Backpropagation [62.45086888512723]
We propose AlignProp, a method that aligns diffusion models to downstream reward functions using end-to-end backpropagation of the reward gradient.
We show AlignProp achieves higher rewards in fewer training steps than alternatives, while being conceptually simpler.
arXiv Detail & Related papers (2023-10-05T17:59:18Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z) - Counterfactual Generative Networks [59.080843365828756]
We propose to decompose the image generation process into independent causal mechanisms that we train without direct supervision.
By exploiting appropriate inductive biases, these mechanisms disentangle object shape, object texture, and background.
We show that the counterfactual images can improve out-of-distribution with a marginal drop in performance on the original classification task.
arXiv Detail & Related papers (2021-01-15T10:23:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.