Masked Vector Quantization
- URL: http://arxiv.org/abs/2301.06626v2
- Date: Mon, 25 Mar 2024 00:45:30 GMT
- Title: Masked Vector Quantization
- Authors: David D. Nguyen, David Leibowitz, Surya Nepal, Salil S. Kanhere,
- Abstract summary: Generative models with discrete latent representations have recently demonstrated an impressive ability to learn complex data distributions.
We propose the Masked Vector Quantization (MVQ) framework which increases the representational capacity of each code vector by learning mask configurations.
MVQ reduces FID in existing vector quantization architectures by up to $68%$ at 2 tokens per instance and $57%$ at 5 tokens.
- Score: 19.858406923144404
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Generative models with discrete latent representations have recently demonstrated an impressive ability to learn complex high-dimensional data distributions. However, their performance relies on a long sequence of tokens per instance and a large number of codebook entries, resulting in long sampling times and considerable computation to fit the categorical posterior. To address these issues, we propose the Masked Vector Quantization (MVQ) framework which increases the representational capacity of each code vector by learning mask configurations via a stochastic winner-takes-all training regime called Multiple Hypothese Dropout (MH-Dropout). On ImageNet 64$\times$64, MVQ reduces FID in existing vector quantization architectures by up to $68\%$ at 2 tokens per instance and $57\%$ at 5 tokens. These improvements widen as codebook entries is reduced and allows for $7\textit{--}45\times$ speed-up in token sampling during inference. As an additional benefit, we find that smaller latent spaces lead to MVQ identifying transferable visual representations where multiple can be smoothly combined.
Related papers
- When LLaVA Meets Objects: Token Composition for Vision-Language-Models [31.554057603168214]
Mask-LLaVA is a framework that leverages different levels of visual features to create a compact yet information-rich visual representation for autoregressive Vision Language Models.<n>While all tokens are used during training, it shows that the resulting model can flexibly drop especially the number of mask-based object-tokens at test time.<n>Our analysis demonstrates that combining multi-level features enables efficient learning with fewer tokens while allowing dynamic token selection at test time for good performance.
arXiv Detail & Related papers (2026-02-04T18:50:46Z) - SAMTok: Representing Any Mask with Two Words [70.74140779649856]
We present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens.<n>By treating masks as new language tokens, SAMTok enables base MLLMs to learn pixel-wise capabilities.<n>QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation.
arXiv Detail & Related papers (2026-01-22T16:44:09Z) - SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs [59.415473779171315]
We propose a novel visual token pruning strategy called textbfSaliency-textbfCoverage textbfOriented token textbfPruning for textbfEfficient MLLMs.
arXiv Detail & Related papers (2025-10-28T09:29:37Z) - Partition Generative Modeling: Masked Modeling Without Masks [10.751153162476726]
Masked generative models (MGMs) are widely used to capture complex data and enable faster generation than autoregressive models (AR)<n>In this work, we introduce the Partition Generative Model (PGM), a novel approach that combines the strengths of AR and MGMs.<n>On OpenWebText, PGMs offer at least $5times$ improvements in sampling latency and throughput, while producing samples with superior Generative Perplexity.
arXiv Detail & Related papers (2025-05-24T21:44:32Z) - ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers [70.38258823378557]
Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens.<n>We introduce a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens.<n>We propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers.
arXiv Detail & Related papers (2025-04-01T07:47:55Z) - Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models [50.214593234229255]
We introduce the novel task of Extreme Short Token Reduction, which aims to represent entire videos using a minimal set of discrete tokens.<n>On the Extreme Short Token Reduction task, our VQToken compresses sequences to just 0.07 percent of their original length while incurring only a 0.66 percent drop in accuracy on the NextQA-MC benchmark.
arXiv Detail & Related papers (2025-03-21T09:46:31Z) - DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models [13.519389777060226]
Adding visual tokens to Large Multimodal Models (LMMs) increases the total token count, often by thousands.
To address this issue, token pruning methods, which remove part of the visual tokens, are proposed.
The proposed method, DivPrune, reduces redundancy and achieves the highest diversity of the selected tokens.
arXiv Detail & Related papers (2025-03-04T01:33:14Z) - UniTok: A Unified Tokenizer for Visual Generation and Understanding [69.09699034036124]
We introduce UniTok, a discrete visual tokenizer that encodes fine-grained details for generation while also capturing high-level semantics for understanding.
Our method significantly raises the upper limit of unified discrete tokenizers to match or even surpass domain-specific continuous tokenizers.
arXiv Detail & Related papers (2025-02-27T17:47:01Z) - Scalable Image Tokenization with Index Backpropagation Quantization [74.15447383432262]
Index Backpropagation Quantization (IBQ) is a new VQ method for the joint optimization of all codebook embeddings and the visual encoder.<n>IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook with high dimension ($256$) and high utilization.
arXiv Detail & Related papers (2024-12-03T18:59:10Z) - Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling [53.58854856174773]
Speculative decoding is an approach to accelerate inference through a guess-and-verify paradigm.
Token Recycling stores candidate tokens in an adjacency matrix and employs a breadth-first search algorithm.
It significantly outperforms existing train-free methods by 30% and even a training method by 25%.
arXiv Detail & Related papers (2024-08-16T12:20:56Z) - Matryoshka Query Transformer for Large Vision-Language Models [103.84600181927884]
We introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference.
We train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens.
Our model, MQT-LLAVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA's fixed 576.
arXiv Detail & Related papers (2024-05-29T17:39:42Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models [35.88374542519597]
Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model.
Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly.
We propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs.
arXiv Detail & Related papers (2024-03-22T17:59:52Z) - CITADEL: Conditional Token Interaction via Dynamic Lexical Routing for
Efficient and Effective Multi-Vector Retrieval [72.90850213615427]
Multi-vector retrieval methods combine the merits of sparse (e.g. BM25) and dense (e.g. DPR) retrievers.
These methods are orders of magnitude slower and need much more space to store their indices compared to their single-vector counterparts.
We propose conditional token interaction via dynamic lexical routing, namely CITADEL, for efficient and effective multi-vector retrieval.
arXiv Detail & Related papers (2022-11-18T18:27:35Z) - AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with
Masked Autoencoders [44.87786478095987]
Masked Autoencoders learn general representations for image, text, audio, video, etc., by masked input data from tokens of the visible data.
This paper proposes an adaptive masking strategy for MAEs that is end-to-end trainable.
AdaMAE samples visible tokens based on the semantic context using an auxiliary sampling network.
arXiv Detail & Related papers (2022-11-16T18:59:48Z) - Extreme Masking for Learning Instance and Distributed Visual
Representations [50.152264456036114]
The paper presents a scalable approach for learning distributed representations over individual tokens and a holistic instance representation simultaneously.
We use self-attention blocks to represent distributed tokens, followed by cross-attention blocks to aggregate the holistic instance.
Our model, named ExtreMA, follows the plain BYOL approach where the instance representation from the unmasked subset is trained to predict that from the intact input.
arXiv Detail & Related papers (2022-06-09T17:59:43Z) - PointINS: Point-based Instance Segmentation [117.38579097923052]
Mask representation in instance segmentation with Point-of-Interest (PoI) features is challenging because learning a high-dimensional mask feature for each instance requires a heavy computing burden.
We propose an instance-aware convolution, which decomposes this mask representation learning task into two tractable modules.
Along with instance-aware convolution, we propose PointINS, a simple and practical instance segmentation approach.
arXiv Detail & Related papers (2020-03-13T08:24:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.