Related papers: Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%

Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%

URL: http://arxiv.org/abs/2406.11837v1
Date: Mon, 17 Jun 2024 17:59:57 GMT
Title: Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%
Authors: Lei Zhu, Fangyun Wei, Yanye Lu, Dong Chen,
Abstract summary: We propose a novel image quantization model named VQGAN-LC (Large Codebook), which extends the codebook size to 100,000, achieving an utilization rate exceeding 99%. We demonstrate the superior performance of our model over its counterparts across a variety of tasks, including image reconstruction, image classification, auto-regressive image generation using GPT, and image creation with diffusion- and flow-based generative models.
Score: 35.710953589794855
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the realm of image quantization exemplified by VQGAN, the process encodes images into discrete tokens drawn from a codebook with a predefined size. Recent advancements, particularly with LLAMA 3, reveal that enlarging the codebook significantly enhances model performance. However, VQGAN and its derivatives, such as VQGAN-FC (Factorized Codes) and VQGAN-EMA, continue to grapple with challenges related to expanding the codebook size and enhancing codebook utilization. For instance, VQGAN-FC is restricted to learning a codebook with a maximum size of 16,384, maintaining a typically low utilization rate of less than 12% on ImageNet. In this work, we propose a novel image quantization model named VQGAN-LC (Large Codebook), which extends the codebook size to 100,000, achieving an utilization rate exceeding 99%. Unlike previous methods that optimize each codebook entry, our approach begins with a codebook initialized with 100,000 features extracted by a pre-trained vision encoder. Optimization then focuses on training a projector that aligns the entire codebook with the feature distributions of the encoder in VQGAN-LC. We demonstrate the superior performance of our model over its counterparts across a variety of tasks, including image reconstruction, image classification, auto-regressive image generation using GPT, and image creation with diffusion- and flow-based generative models. Code and models are available at https://github.com/zh460045050/VQGAN-LC.

Related papers

Dual Codebook VQ: Enhanced Image Reconstruction with Reduced Codebook Size [0.0]
Vector Quantization (VQ) techniques face challenges in codebook utilization, limiting reconstruction fidelity in image modeling. We introduce a Dual Codebook mechanism that effectively addresses this limitation by partitioning the representation into complementary global and local components. Our approach achieves significant FID improvements across diverse image domains, particularly excelling in scene and face reconstruction tasks.
arXiv Detail & Related papers (2025-03-13T19:31:18Z)
Scalable Image Tokenization with Index Backpropagation Quantization [74.15447383432262]
Index Backpropagation Quantization (IBQ) is a new VQ method for the joint optimization of all codebook embeddings and the visual encoder. IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook with high dimension ($256$) and high utilization.
arXiv Detail & Related papers (2024-12-03T18:59:10Z)
Image Understanding Makes for A Good Tokenizer for Image Generation [62.875788091204626]
We introduce a token-based IG framework, which relies on effective tokenizers to project images into token sequences. We show that tokenizers with strong IU capabilities achieve superior IG performance across a variety of metrics, datasets, tasks, and proposal networks.
arXiv Detail & Related papers (2024-11-07T03:55:23Z)
LG-VQ: Language-Guided Codebook Learning [36.422599206253324]
Vector quantization (VQ) is a key technique in high-resolution and high-fidelity image synthesis. We propose a novel language-guided codebook learning framework, called LG-VQ. Our method achieves superior performance on reconstruction and various multi-modal downstream tasks.
arXiv Detail & Related papers (2024-05-23T06:04:40Z)
Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling [15.132926378740882]
We propose a novel codebook transfer framework with part-of-speech, called VQCT, which aims to transfer a well-trained codebook from pretrained language models to VQIM. Experimental results on four datasets show that our VQCT method achieves superior VQIM performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2024-03-15T07:24:13Z)
Online Clustered Codebook [100.1650001618827]
We present a simple alternative method for online codebook learning, Clustering VQ-VAE (CVQ-VAE) Our approach selects encoded features as anchors to update the dead'' codevectors, while optimising the codebooks which are alive via the original loss. Our CVQ-VAE can be easily integrated into the existing models with just a few lines of code.
arXiv Detail & Related papers (2023-07-27T18:31:04Z)
Learning Image-Adaptive Codebooks for Class-Agnostic Image Restoration [13.718779033187786]
We propose AdaCode for learning image-adaptive codebooks for class-agnostic image restoration. AdaCode is a more flexible and expressive discrete generative prior than previous work.
arXiv Detail & Related papers (2023-06-10T19:32:47Z)
Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation [78.13793505707952]
Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook. We propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) Stack model from modeling redundancy.
arXiv Detail & Related papers (2023-05-23T02:15:53Z)
Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization [73.52943587514386]
Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm. We propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based their information densities for accurate representation.
arXiv Detail & Related papers (2023-05-19T14:56:05Z)
FewGAN: Generating from the Joint Distribution of a Few Images [95.6635227371479]
We introduce FewGAN, a generative model for generating novel, high-quality and diverse images. FewGAN is a hierarchical patch-GAN that applies quantization at the first coarse scale, followed by a pyramid of residual fully convolutional GANs at finer scales. In an extensive set of experiments, it is shown that FewGAN outperforms baselines both quantitatively and qualitatively.
arXiv Detail & Related papers (2022-07-18T07:11:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.