Optimization of embeddings storage for RAG systems using quantization and dimensionality reduction techniques
- URL: http://arxiv.org/abs/2505.00105v1
- Date: Wed, 30 Apr 2025 18:20:16 GMT
- Title: Optimization of embeddings storage for RAG systems using quantization and dimensionality reduction techniques
- Authors: Naamán Huerga-Pérez, Rubén Álvarez, Rubén Ferrero-Guillén, Alberto Martínez-Gutiérrez, Javier Díez-González,
- Abstract summary: We show that float8 quantization achieves a 4x storage reduction with minimal performance degradation.<n> PCA emerges as the most effective dimensionality reduction technique.<n>We propose a methodology based on visualizing the performance-storage trade-off space to identify the optimal configuration.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Retrieval-Augmented Generation enhances language models by retrieving relevant information from external knowledge bases, relying on high-dimensional vector embeddings typically stored in float32 precision. However, storing these embeddings at scale presents significant memory challenges. To address this issue, we systematically investigate on MTEB benchmark two complementary optimization strategies: quantization, evaluating standard formats (float16, int8, binary) and low-bit floating-point types (float8), and dimensionality reduction, assessing methods like PCA, Kernel PCA, UMAP, Random Projections and Autoencoders. Our results show that float8 quantization achieves a 4x storage reduction with minimal performance degradation (<0.3%), significantly outperforming int8 quantization at the same compression level, being simpler to implement. PCA emerges as the most effective dimensionality reduction technique. Crucially, combining moderate PCA (e.g., retaining 50% dimensions) with float8 quantization offers an excellent trade-off, achieving 8x total compression with less performance impact than using int8 alone (which provides only 4x compression). To facilitate practical application, we propose a methodology based on visualizing the performance-storage trade-off space to identify the optimal configuration that maximizes performance within their specific memory constraints.
Related papers
- Post-Training Quantization of OpenPangu Models for Efficient Deployment on Atlas A2 [3.309291427648113]
Huawei's openPangu-Embedded-1B and openPangu-Embedded-7B integrate three distinct Chain-of-Thought (CoT) reasoning paradigms.<n>These CoT modes introduce substantial memory and latency overheads, posing challenges for practical deployment on Ascend NPUs.<n>This paper addresses these computational constraints by leveraging low-bit quantization, which transforms FP16 computations into more efficient integer arithmetic.
arXiv Detail & Related papers (2025-12-29T10:50:23Z) - Dimension vs. Precision: A Comparative Analysis of Autoencoders and Quantization for Efficient Vector Retrieval on BEIR SciFact [0.0]
Int8 quantization provides the most effective "sweet spot," achieving a 4x compression with a negligible [1-2%] drop in nDCG@10.<n>Autoencoders show a graceful degradation but suffer a more significant performance loss at equivalent 4x compression ratios.<n> binary quantization was found to be unsuitable for this task due to catastrophic performance drops.
arXiv Detail & Related papers (2025-11-17T07:02:11Z) - Binary Quantization For LLMs Through Dynamic Grouping [13.578307208515819]
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of Natural Language Processing (NLP) tasks.<n> Binary quantization, which compresses model weights from 16-bit Brain Float to 1-bit representations in -1, 1, offers significant reductions in storage and inference costs.<n>We propose a novel optimization objective tailored for binary quantization, along with three algorithms designed to realize it effectively.
arXiv Detail & Related papers (2025-09-03T06:36:21Z) - Lossless Compression of Neural Network Components: Weights, Checkpoints, and K/V Caches in Low-Precision Formats [0.0]
In this work, we extend the ZipNN approach to lower-precision floating-point formats, specifically FP8 and FP4.<n>Our evaluation shows compression ratios up to 62% for BF16 and 83% for FP8.<n>We also investigate the compressibility of key-value (K/V) cache tensors used in large language models.
arXiv Detail & Related papers (2025-08-20T12:46:50Z) - Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference [3.7687375904925484]
We propose a novel hardware-efficient quantization and inference scheme that exploits hardware advantages with minimal accuracy degradation.<n>We develop a novel quantization algorithm, dubbed Dual Precision Quantization (DPQ), that leverages the unique structure of our scheme without introducing additional inference overhead.
arXiv Detail & Related papers (2025-05-20T17:26:12Z) - Efficient Fine-Tuning of Quantized Models via Adaptive Rank and Bitwidth [10.872650037112255]
QLoRA effectively combines low-bit quantization and LoRA to achieve memory-friendly fine-tuning for large language models (LLM)<n>We propose textbfQR-Adaptor, a unified, gradient-free strategy that uses partial calibration data to jointly search the quantization components and the rank of low-rank spaces for each layer.<n>Our approach achieves a 4.89% accuracy improvement on GSM8K, and in some cases even outperforms the 16-bit fine-tuned model while maintaining the memory footprint of the 4-bit setting.
arXiv Detail & Related papers (2025-05-02T08:46:01Z) - Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [65.37942405146232]
We present a novel type of overload that carries with extremely lightweight state elements, achieved through ultra-low-precision quantization.<n>The proposed SOLO achieves substantial memory savings (approximately 45 GB when training a 7B model) with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z) - Q-PETR: Quant-aware Position Embedding Transformation for Multi-View 3D Object Detection [9.961425621432474]
We propose Q-PETR, a quantization-aware position embedding transformation that re-engineers key components of the PETR framework.
Q-PETR maintains floating-point performance with a performance degradation of less than 1% under standard 8-bit per-tensor post-training quantization.
Compared to its FP32 counterpart, Q-PETR achieves a two-fold speedup and reduces memory usage by three times.
arXiv Detail & Related papers (2025-02-21T14:26:23Z) - "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear.<n>We conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks.
arXiv Detail & Related papers (2024-11-04T18:21:59Z) - Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks.
We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems.
Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z) - MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization [16.83403134551842]
Recent few-step diffusion models reduce the inference time by reducing the denoising steps.
The Post Training Quantization (PTQ) replaces high bit-width FP representation with low-bit integer values.
However, when applying to few-step diffusion models, existing quantization methods face challenges in preserving both the image quality and text alignment.
arXiv Detail & Related papers (2024-05-28T06:50:58Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed
FP16-INT8 Post-Training Quantization [0.0]
Speech Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) are deployed on a state-of-the-art MicroController Unit (MCU)
We propose an optimized software pipeline interleaving parallel computation of LSTM or GRU recurrent blocks with manually-managed memory transfers.
Experiments are conducted on multiple LSTM and GRU based SE models trained on the Valentini dataset, featuring up to 1.24M parameters.
arXiv Detail & Related papers (2022-10-14T10:32:05Z) - 4-bit Conformer with Native Quantization Aware Training for Speech
Recognition [13.997832593421577]
We propose to develop 4-bit ASR models with native quantization aware training, which leverages native integer operations to effectively optimize both training and inference.
We conducted two experiments on state-of-the-art Conformer-based ASR models to evaluate our proposed quantization technique.
For the first time investigated and revealed the viability of 4-bit quantization on a practical ASR system that is trained with large-scale datasets.
arXiv Detail & Related papers (2022-03-29T23:57:15Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.