Finite Scalar Quantization: VQ-VAE Made Simple
- URL: http://arxiv.org/abs/2309.15505v2
- Date: Thu, 12 Oct 2023 07:55:05 GMT
- Title: Finite Scalar Quantization: VQ-VAE Made Simple
- Authors: Fabian Mentzer, David Minnen, Eirikur Agustsson, Michael Tschannen
- Abstract summary: We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ)
By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ.
We employ FSQ with MaskGIT for image generation, and with UViM for depth estimation, colorization, and panoptic segmentation.
- Score: 26.351016719675766
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose to replace vector quantization (VQ) in the latent representation
of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where
we project the VAE representation down to a few dimensions (typically less than
10). Each dimension is quantized to a small set of fixed values, leading to an
(implicit) codebook given by the product of these sets. By appropriately
choosing the number of dimensions and values each dimension can take, we obtain
the same codebook size as in VQ. On top of such discrete representations, we
can train the same models that have been trained on VQ-VAE representations. For
example, autoregressive and masked transformer models for image generation,
multimodal generation, and dense prediction computer vision tasks. Concretely,
we employ FSQ with MaskGIT for image generation, and with UViM for depth
estimation, colorization, and panoptic segmentation. Despite the much simpler
design of FSQ, we obtain competitive performance in all these tasks. We
emphasize that FSQ does not suffer from codebook collapse and does not need the
complex machinery employed in VQ (commitment losses, codebook reseeding, code
splitting, entropy penalties, etc.) to learn expressive discrete
representations.
Related papers
- Factorized Visual Tokenization and Generation [37.56136469262736]
We introduce Factorized Quantization (FQ), a novel approach that revitalizes VQ-based tokenizers by decomposing a large codebook into multiple independent sub-codebooks.
This factorization reduces the lookup complexity of large codebooks, enabling more efficient and scalable visual tokenization.
Experiments show that the proposed FQGAN model substantially improves the reconstruction quality of visual tokenizers, achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-11-25T18:59:53Z) - LASERS: LAtent Space Encoding for Representations with Sparsity for Generative Modeling [3.9426000822656224]
We show that our more latent space is more expressive and has leads to better representations than the Vector Quantization approach.
Our results thus suggest that the true benefit of the VQ approach might not be from discretization of the latent space, but rather the lossy compression of the latent space.
arXiv Detail & Related papers (2024-09-16T08:20:58Z) - HyperVQ: MLR-based Vector Quantization in Hyperbolic Space [56.4245885674567]
We study the use of hyperbolic spaces for vector quantization (HyperVQ)
We show that hyperVQ performs comparably in reconstruction and generative tasks while outperforming VQ in discriminative tasks and learning a highly disentangled latent space.
arXiv Detail & Related papers (2024-03-18T03:17:08Z) - Soft Convex Quantization: Revisiting Vector Quantization with Convex
Optimization [40.1651740183975]
We propose Soft Convex Quantization (SCQ) as a direct substitute for Vector Quantization (VQ)
SCQ works like a differentiable convex optimization (DCO) layer.
We demonstrate its efficacy on the CIFAR-10, GTSRB and LSUN datasets.
arXiv Detail & Related papers (2023-10-04T17:45:14Z) - Online Clustered Codebook [100.1650001618827]
We present a simple alternative method for online codebook learning, Clustering VQ-VAE (CVQ-VAE)
Our approach selects encoded features as anchors to update the dead'' codevectors, while optimising the codebooks which are alive via the original loss.
Our CVQ-VAE can be easily integrated into the existing models with just a few lines of code.
arXiv Detail & Related papers (2023-07-27T18:31:04Z) - Not All Image Regions Matter: Masked Vector Quantization for
Autoregressive Image Generation [78.13793505707952]
Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook.
We propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) Stack model from modeling redundancy.
arXiv Detail & Related papers (2023-05-23T02:15:53Z) - Vector Quantized Wasserstein Auto-Encoder [57.29764749855623]
We study learning deep discrete representations from the generative viewpoint.
We endow discrete distributions over sequences of codewords and learn a deterministic decoder that transports the distribution over the sequences of codewords to the data distribution.
We develop further theories to connect it with the clustering viewpoint of WS distance, allowing us to have a better and more controllable clustering solution.
arXiv Detail & Related papers (2023-02-12T13:51:36Z) - SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed
Stochastic Quantization [13.075574481614478]
One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook.
We propose a new training scheme that extends the standard VAE via novel dequantization and quantization.
Our experiments show that SQ-VAE improves codebook utilization without using commons.
arXiv Detail & Related papers (2022-05-16T09:49:37Z) - VQFR: Blind Face Restoration with Vector-Quantized Dictionary and
Parallel Decoder [83.63843671885716]
We propose a VQ-based face restoration method -- VQFR.
VQFR takes advantage of high-quality low-level feature banks extracted from high-quality faces.
To further fuse low-level features from inputs while not "contaminating" the realistic details generated from the VQ codebook, we proposed a parallel decoder.
arXiv Detail & Related papers (2022-05-13T17:54:40Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.