Related papers: Taming Scalable Visual Tokenizer for Autoregressive Image Generation

Taming Scalable Visual Tokenizer for Autoregressive Image Generation

URL: http://arxiv.org/abs/2412.02692v1
Date: Tue, 03 Dec 2024 18:59:10 GMT
Title: Taming Scalable Visual Tokenizer for Autoregressive Image Generation
Authors: Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, Limin Wang,
Abstract summary: Index Backpropagation Quantization (IBQ) is a new VQ method for the joint optimization of all codebook embeddings and the visual encoder.<n>IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook with high dimension ($256$) and high utilization.
Score: 74.15447383432262
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing vector quantization (VQ) methods struggle with scalability, largely attributed to the instability of the codebook that undergoes partial updates during training. The codebook is prone to collapse as utilization decreases, due to the progressively widening distribution gap between non-activated codes and visual features. To solve the problem, we propose Index Backpropagation Quantization (IBQ), a new VQ method for the joint optimization of all codebook embeddings and the visual encoder. Applying a straight-through estimator on the one-hot categorical distribution between the encoded feature and codebook, all codes are differentiable and maintain a consistent latent space with the visual encoder. IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook ($2^{18}$) with high dimension ($256$) and high utilization. Experiments on the standard ImageNet benchmark demonstrate the scalability and superiority of IBQ, achieving competitive results on both reconstruction ($1.00$ rFID) and autoregressive visual generation ($2.05$ gFID). The code and models are available at https://github.com/TencentARC/SEED-Voken.

Related papers

CODA: Repurposing Continuous VAEs for Discrete Tokenization [52.58960429582813]
textbfCODA(textbfCOntinuous-to-textbfDiscrete textbfAdaptation) is a framework that decouples compression and discretization. Our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of $mathbf0.43$ and $mathbf1.34$ for $8 times$ and $16 times$ compression on ImageNet 256$times$ 256 benchmark.
arXiv Detail & Related papers (2025-03-22T12:59:00Z)
Dual Codebook VQ: Enhanced Image Reconstruction with Reduced Codebook Size [0.0]
Vector Quantization (VQ) techniques face challenges in codebook utilization, limiting reconstruction fidelity in image modeling. We introduce a Dual Codebook mechanism that effectively addresses this limitation by partitioning the representation into complementary global and local components. Our approach achieves significant FID improvements across diverse image domains, particularly excelling in scene and face reconstruction tasks.
arXiv Detail & Related papers (2025-03-13T19:31:18Z)
SSVQ: Unleashing the Potential of Vector Quantization with Sign-Splitting [7.663974702092357]
We introduce a novel VQ paradigm, Sign-Splitting VQ (SSVQ), which decouples the sign bit of weights from the codebook.<n>We show that SSVQ achieves a significantly superior compression-accuracy trade-off compared to conventional VQ.
arXiv Detail & Related papers (2025-03-11T17:52:48Z)
Factorized Visual Tokenization and Generation [37.56136469262736]
We introduce Factorized Quantization (FQ), a novel approach that revitalizes VQ-based tokenizers by decomposing a large codebook into multiple independent sub-codebooks.<n>This factorization reduces the lookup complexity of large codebooks, enabling more efficient and scalable visual tokenization.<n> Experiments show that the proposed FQGAN model substantially improves the reconstruction quality of visual tokenizers, achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-11-25T18:59:53Z)
Image Understanding Makes for A Good Tokenizer for Image Generation [62.875788091204626]
We introduce a token-based IG framework, which relies on effective tokenizers to project images into token sequences. We show that tokenizers with strong IU capabilities achieve superior IG performance across a variety of metrics, datasets, tasks, and proposal networks.
arXiv Detail & Related papers (2024-11-07T03:55:23Z)
Addressing Representation Collapse in Vector Quantized Models with One Linear Layer [10.532262196027752]
Vector Quantization (VQ) is a widely used method for converting continuous representations into discrete codes. VQ models are often hindered by the problem of representation collapse in the latent space. We propose textbfSimVQ, a novel method which re parameterizes the code vectors through a linear transformation layer based on a learnable latent basis.
arXiv Detail & Related papers (2024-11-04T12:40:18Z)
HyperVQ: MLR-based Vector Quantization in Hyperbolic Space [56.4245885674567]
A common solution is to employ Vector Quantization (VQ) within VQ Variational Autoencoders (VQVAEs) We introduce HyperVQ, a novel approach that formulates VQ as a hyperbolic Multinomial Logistic Regression (MLR) problem. Our experiments demonstrate that HyperVQ matches traditional VQ in generative and reconstruction tasks, while surpassing it in discriminative performance.
arXiv Detail & Related papers (2024-03-18T03:17:08Z)
Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling [15.132926378740882]
We propose a novel codebook transfer framework with part-of-speech, called VQCT, which aims to transfer a well-trained codebook from pretrained language models to VQIM. Experimental results on four datasets show that our VQCT method achieves superior VQIM performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2024-03-15T07:24:13Z)
Soft Convex Quantization: Revisiting Vector Quantization with Convex Optimization [40.1651740183975]
We propose Soft Convex Quantization (SCQ) as a direct substitute for Vector Quantization (VQ) SCQ works like a differentiable convex optimization (DCO) layer. We demonstrate its efficacy on the CIFAR-10, GTSRB and LSUN datasets.
arXiv Detail & Related papers (2023-10-04T17:45:14Z)
Dual Associated Encoder for Face Restoration [68.49568459672076]
We propose a novel dual-branch framework named DAEFR to restore facial details from low-quality (LQ) images. Our method introduces an auxiliary LQ branch that extracts crucial information from the LQ inputs. We evaluate the effectiveness of DAEFR on both synthetic and real-world datasets.
arXiv Detail & Related papers (2023-08-14T17:58:33Z)
Online Clustered Codebook [100.1650001618827]
We present a simple alternative method for online codebook learning, Clustering VQ-VAE (CVQ-VAE) Our approach selects encoded features as anchors to update the dead'' codevectors, while optimising the codebooks which are alive via the original loss. Our CVQ-VAE can be easily integrated into the existing models with just a few lines of code.
arXiv Detail & Related papers (2023-07-27T18:31:04Z)
Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization [73.52943587514386]
Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm. We propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based their information densities for accurate representation.
arXiv Detail & Related papers (2023-05-19T14:56:05Z)
VQFR: Blind Face Restoration with Vector-Quantized Dictionary and Parallel Decoder [83.63843671885716]
We propose a VQ-based face restoration method -- VQFR. VQFR takes advantage of high-quality low-level feature banks extracted from high-quality faces. To further fuse low-level features from inputs while not "contaminating" the realistic details generated from the VQ codebook, we proposed a parallel decoder.
arXiv Detail & Related papers (2022-05-13T17:54:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.