Related papers: Dimension vs. Precision: A Comparative Analysis of Autoencoders and Quantization for Efficient Vector Retrieval on BEIR SciFact

Dimension vs. Precision: A Comparative Analysis of Autoencoders and Quantization for Efficient Vector Retrieval on BEIR SciFact

URL: http://arxiv.org/abs/2511.13057v2
Date: Tue, 18 Nov 2025 16:07:31 GMT
Title: Dimension vs. Precision: A Comparative Analysis of Autoencoders and Quantization for Efficient Vector Retrieval on BEIR SciFact
Authors: Satyanarayan Pati,
Abstract summary: Int8 quantization provides the most effective "sweet spot," achieving a 4x compression with a negligible [1-2%] drop in nDCG@10.<n>Autoencoders show a graceful degradation but suffer a more significant performance loss at equivalent 4x compression ratios.<n> binary quantization was found to be unsuitable for this task due to catastrophic performance drops.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dense retrieval models have become a standard for state-of-the-art information retrieval. However, their high-dimensional, high-precision (float32) vector embeddings create significant storage and memory challenges for real-world deployment. To address this, we conduct a rigorous empirical study on the BEIR SciFact benchmark, evaluating the trade-offs between two primary compression strategies: (1) Dimensionality Reduction via deep Autoencoders (AE), reducing original 384-dim vectors to latent spaces from 384 down to 12, and (2) Precision Reduction via Quantization (float16, int8, and binary). We systematically compare each method by measuring the "performance loss" (or gain) relative to a float32 baseline across a full suite of retrieval metrics (NDCG, MAP, MRR, Recall, Precision) at various k cutoffs. Our results show that int8 scalar quantization provides the most effective "sweet spot," achieving a 4x compression with a negligible [~1-2%] drop in nDCG@10. In contrast, Autoencoders show a graceful degradation but suffer a more significant performance loss at equivalent 4x compression ratios (AE-96). binary quantization was found to be unsuitable for this task due to catastrophic performance drops. This work provides a practical guide for deploying efficient, high-performance retrieval systems.

Related papers

LACONIC: Dense-Level Effectiveness for Scalable Sparse Retrieval via a Two-Phase Training Curriculum [73.82125917416067]
LACONIC is a family of learned sparse retrievers based on the Llama-3 architecture.<n>The 8B variant achieves a state-of-the-art 60.2 nDCG on the MTEB Retrieval benchmark, ranking 15th on the leaderboard.
arXiv Detail & Related papers (2026-01-04T22:42:20Z)
ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization [99.96330641363396]
ARMOR: (Adaptive Representation with Matrix-factORization) is a novel one-shot post-training pruning algorithm.<n>Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices.<n>We demonstrate ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations.
arXiv Detail & Related papers (2025-10-07T02:39:20Z)
CALR: Corrective Adaptive Low-Rank Decomposition for Efficient Large Language Model Layer Compression [0.0]
Large Language Models (LLMs) present significant deployment challenges due to their immense size and computational requirements.<n>We introduce Corrective Adaptive Low-Rank Decomposition (CALR), a two-component compression approach.<n>We show that CALR can reduce parameter counts by 26.93% to 51.77% while retaining 59.45% to 90.42% of the original model's performance.
arXiv Detail & Related papers (2025-08-21T13:16:02Z)
Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [64.62231094774211]
Statefuls (e.g., Adam) maintain auxiliary information even 2x the model size in order to achieve optimal convergence.<n>SOLO enables Adam-styles to maintain quantized states with precision as low as 3 bits, or even 2 bits.<n>SOLO can thus be seamlessly applied to Adam-styles, leading to substantial memory savings with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z)
Optimization of embeddings storage for RAG systems using quantization and dimensionality reduction techniques [0.0]
We show that float8 quantization achieves a 4x storage reduction with minimal performance degradation.<n> PCA emerges as the most effective dimensionality reduction technique.<n>We propose a methodology based on visualizing the performance-storage trade-off space to identify the optimal configuration.
arXiv Detail & Related papers (2025-04-30T18:20:16Z)
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks. We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems. Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
Make RepVGG Greater Again: A Quantization-aware Approach [22.36179771869403]
We propose a simple, robust, and effective remedy to have a quantization-friendly structure. Without bells and whistles, the top-1 accuracy drop on ImageNet is reduced within 2% by standard post-training quantization. Our method also achieves similar FP32 performance as RepVGG.
arXiv Detail & Related papers (2022-12-03T11:14:10Z)
4-bit Conformer with Native Quantization Aware Training for Speech Recognition [13.997832593421577]
We propose to develop 4-bit ASR models with native quantization aware training, which leverages native integer operations to effectively optimize both training and inference. We conducted two experiments on state-of-the-art Conformer-based ASR models to evaluate our proposed quantization technique. For the first time investigated and revealed the viability of 4-bit quantization on a practical ASR system that is trained with large-scale datasets.
arXiv Detail & Related papers (2022-03-29T23:57:15Z)
Optimizing Vessel Trajectory Compression [71.42030830910227]
In previous work we introduced a trajectory detection module that can provide summarized representations of vessel trajectories by consuming AIS positional messages online. This methodology can provide reliable trajectory synopses with little deviations from the original course by discarding at least 70% of the raw data as redundant. However, such trajectory compression is very sensitive to parametrization. We take into account the type of each vessel in order to provide a suitable configuration that can yield improved trajectory synopses.
arXiv Detail & Related papers (2020-05-11T20:38:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.