Post-training quantization of vision encoders needs prefixing registers
- URL: http://arxiv.org/abs/2510.04547v2
- Date: Fri, 10 Oct 2025 06:33:23 GMT
- Title: Post-training quantization of vision encoders needs prefixing registers
- Authors: Seunghyeon Kim, Jinho Kim, Taesun Yeom, Wonpyo Park, Kyuyeun Kim, Jaeho Lee,
- Abstract summary: We propose a training-free algorithm to mitigate outliers in vision encoders.<n>The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder.<n>Our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.
- Score: 9.5677950128084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based vision encoders -- such as CLIP -- are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Post-training quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose $\textit{RegCache}$, a training-free algorithm to mitigate outliers in vision encoders, enabling quantization with significantly smaller accuracy drops. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.
Related papers
- On the Adversarial Robustness of Discrete Image Tokenizers [56.377796750281796]
We first formulate attacks that aim to perturb the features extracted by discrete tokenizers, and thus change the extracted tokens.<n>We fine-tune popular tokenizers with unsupervised adversarial training, keeping all other components frozen.<n>Our approach significantly improves robustness to both unsupervised and end-to-end supervised attacks and generalizes well to unseen tasks and data.
arXiv Detail & Related papers (2026-02-20T14:39:17Z) - OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence [113.73007911004446]
OneVision-Encoder encodes video by compressing visual structure into semantic meaning.<n>Codec-aligned, patch-level sparsity is a foundational principle, enabling OV-Encoder as a scalable engine for next-generation visual generalists.
arXiv Detail & Related papers (2026-02-09T14:06:17Z) - METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models [92.37117312251755]
We propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR)<n>For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy.<n>For multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning.
arXiv Detail & Related papers (2025-07-28T13:50:53Z) - ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models [59.47738955960352]
ToDRE is a two-stage and training-free token compression framework.<n>It achieves superior performance by pruning tokens based on token Diversity and token-task RElevance.
arXiv Detail & Related papers (2025-05-24T15:47:49Z) - Unveiling Encoder-Free Vision-Language Models [62.52803514667452]
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks.
We bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs.
We launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently.
arXiv Detail & Related papers (2024-06-17T17:59:44Z) - Regress Before Construct: Regress Autoencoder for Point Cloud
Self-supervised Learning [18.10704604275133]
Masked Autoencoders (MAE) have demonstrated promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Point Regress AutoEncoder (Point-RAE), a new scheme for regressive autoencoders for point cloud self-supervised learning.
Our approach is efficient during pre-training and generalizes well on various downstream tasks.
arXiv Detail & Related papers (2023-09-25T17:23:33Z) - Self-Supervised Point Cloud Representation Learning with Occlusion
Auto-Encoder [63.77257588569852]
We present 3D Occlusion Auto-Encoder (3D-OAE) for learning representations for point clouds.
Our key idea is to randomly occlude some local patches of the input point cloud and establish the supervision via recovering the occluded patches.
In contrast with previous methods, our 3D-OAE can remove a large proportion of patches and predict them only with a small number of visible patches.
arXiv Detail & Related papers (2022-03-26T14:06:29Z) - Sparsity and Sentence Structure in Encoder-Decoder Attention of
Summarization Systems [38.672160430296536]
Transformer models have achieved state-of-the-art results in a wide range of NLP tasks including summarization.
Previous work has focused on one important bottleneck, the quadratic self-attention mechanism in the encoder.
This work focuses on the transformer's encoder-decoder attention mechanism.
arXiv Detail & Related papers (2021-09-08T19:32:42Z) - A manifold learning perspective on representation learning: Learning
decoder and representations without an encoder [0.0]
Autoencoders are commonly used in representation learning.
Inspired by manifold learning, we show that the decoder can be trained on its own by learning the representations of the training samples.
Our approach of training the decoder alone facilitates representation learning even on small data sets.
arXiv Detail & Related papers (2021-08-31T15:08:50Z) - Dynamic Neural Representational Decoders for High-Resolution Semantic
Segmentation [98.05643473345474]
We propose a novel decoder, termed dynamic neural representational decoder (NRD)
As each location on the encoder's output corresponds to a local patch of the semantic labels, in this work, we represent these local patches of labels with compact neural networks.
This neural representation enables our decoder to leverage the smoothness prior in the semantic label space, and thus makes our decoder more efficient.
arXiv Detail & Related papers (2021-07-30T04:50:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.