CODA: Repurposing Continuous VAEs for Discrete Tokenization
- URL: http://arxiv.org/abs/2503.17760v1
- Date: Sat, 22 Mar 2025 12:59:00 GMT
- Title: CODA: Repurposing Continuous VAEs for Discrete Tokenization
- Authors: Zeyu Liu, Zanlin Ni, Yeguo Hua, Xin Deng, Xiao Ma, Cheng Zhong, Gao Huang,
- Abstract summary: textbfCODA(textbfCOntinuous-to-textbfDiscrete textbfAdaptation) is a framework that decouples compression and discretization.<n>Our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of $mathbf0.43$ and $mathbf1.34$ for $8 times$ and $16 times$ compression on ImageNet 256$times$ 256 benchmark.
- Score: 52.58960429582813
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Discrete visual tokenizers transform images into a sequence of tokens, enabling token-based visual generation akin to language models. However, this process is inherently challenging, as it requires both compressing visual signals into a compact representation and discretizing them into a fixed set of codes. Traditional discrete tokenizers typically learn the two tasks jointly, often leading to unstable training, low codebook utilization, and limited reconstruction quality. In this paper, we introduce \textbf{CODA}(\textbf{CO}ntinuous-to-\textbf{D}iscrete \textbf{A}daptation), a framework that decouples compression and discretization. Instead of training discrete tokenizers from scratch, CODA adapts off-the-shelf continuous VAEs -- already optimized for perceptual compression -- into discrete tokenizers via a carefully designed discretization process. By primarily focusing on discretization, CODA ensures stable and efficient training while retaining the strong visual fidelity of continuous VAEs. Empirically, with $\mathbf{6 \times}$ less training budget than standard VQGAN, our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of $\mathbf{0.43}$ and $\mathbf{1.34}$ for $8 \times$ and $16 \times$ compression on ImageNet 256$\times$ 256 benchmark.
Related papers
- Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation [63.89280381800457]
We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens.<n>We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism.<n>Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
arXiv Detail & Related papers (2025-03-20T17:59:59Z) - UniTok: A Unified Tokenizer for Visual Generation and Understanding [69.09699034036124]
We introduce UniTok, a discrete visual tokenizer that encodes fine-grained details for generation while also capturing high-level semantics for understanding.<n>Our method significantly raises the upper limit of unified discrete tokenizers to match or even surpass domain-specific continuous tokenizers.
arXiv Detail & Related papers (2025-02-27T17:47:01Z) - GaussianToken: An Effective Image Tokenizer with 2D Gaussian Splatting [64.84383010238908]
We propose an effective image tokenizer with 2D Gaussian Splatting as a solution.<n>In general, our framework integrates the local influence of 2D Gaussian distribution into the discrete space.<n> Competitive reconstruction performances on CIFAR, Mini-Net, and ImageNet-1K demonstrate the effectiveness of our framework.
arXiv Detail & Related papers (2025-01-26T17:56:11Z) - SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization [20.109136454526233]
We propose SweetTok, a novel video tokenizer to overcome the limitations in current video tokenization methods.<n>SweetTok compress visual inputs through distinct spatial and temporal queries via textbfDecoupled textbfAutotextbfEncoder (DQAE)<n>We show that SweetTok significantly improves video reconstruction results by textbf42.8% w.r.t rFVD on UCF-101 dataset.
arXiv Detail & Related papers (2024-12-11T13:48:06Z) - Scalable Image Tokenization with Index Backpropagation Quantization [74.15447383432262]
Index Backpropagation Quantization (IBQ) is a new VQ method for the joint optimization of all codebook embeddings and the visual encoder.<n>IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook with high dimension ($256$) and high utilization.
arXiv Detail & Related papers (2024-12-03T18:59:10Z) - Factorized Visual Tokenization and Generation [37.56136469262736]
We introduce Factorized Quantization (FQ), a novel approach that revitalizes VQ-based tokenizers by decomposing a large codebook into multiple independent sub-codebooks.
This factorization reduces the lookup complexity of large codebooks, enabling more efficient and scalable visual tokenization.
Experiments show that the proposed FQGAN model substantially improves the reconstruction quality of visual tokenizers, achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-11-25T18:59:53Z) - Continuous Speculative Decoding for Autoregressive Image Generation [33.05392461723613]
Continuous-valued Autoregressive (AR) image generation models have demonstrated notable superiority over their discrete-token counterparts.
speculative decoding has proven effective in accelerating Large Language Models (LLMs)
This work generalizes the speculative decoding algorithm from discrete tokens to continuous space.
arXiv Detail & Related papers (2024-11-18T09:19:15Z) - LoCoCo: Dropping In Convolutions for Long Context Compression [77.26610232994508]
This paper presents a novel approach, Dropping In Convolutions for Long Context Compression (LoCoCo)
LoCoCo employs only a fixed-size Key-Value ( KV) cache, and can enhance efficiency in both inference and fine-tuning stages.
arXiv Detail & Related papers (2024-06-08T01:35:11Z) - SC-VAE: Sparse Coding-based Variational Autoencoder with Learned ISTA [0.6770292596301478]
We introduce a new VAE variant, termed sparse coding-based VAE with learned ISTA (SC-VAE), which integrates sparse coding within variational autoencoder framework.
Experiments on two image datasets demonstrate that our model achieves improved image reconstruction results compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-03-29T13:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.