Hierarchical Patch Compression for ColPali: Efficient Multi-Vector Document Retrieval with Dynamic Pruning and Quantization
- URL: http://arxiv.org/abs/2506.21601v2
- Date: Wed, 02 Jul 2025 03:32:17 GMT
- Title: Hierarchical Patch Compression for ColPali: Efficient Multi-Vector Document Retrieval with Dynamic Pruning and Quantization
- Authors: Duong Bach,
- Abstract summary: Multi-vector document retrieval systems, such as ColPali, excel in fine-grained matching for complex queries but incur significant storage and computational costs.<n>We propose HPC-ColPali, a grained Patch Compression framework that enhances the efficiency of ColPali while preserving its retrieval accuracy.<n>Our approach integrates three innovative techniques: (1) K-Means quantization, which compresses patch embeddings into 1-byte centroid indices, achieving up to 32$times$ storage reduction; (2) attention-guided dynamic pruning, utilizing Vision-Language Model attention weights to retain only the top-$p%$ most
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Multi-vector document retrieval systems, such as ColPali, excel in fine-grained matching for complex queries but incur significant storage and computational costs due to their reliance on high-dimensional patch embeddings and late-interaction scoring. To address these challenges, we propose HPC-ColPali, a Hierarchical Patch Compression framework that enhances the efficiency of ColPali while preserving its retrieval accuracy. Our approach integrates three innovative techniques: (1) K-Means quantization, which compresses patch embeddings into 1-byte centroid indices, achieving up to 32$\times$ storage reduction; (2) attention-guided dynamic pruning, utilizing Vision-Language Model attention weights to retain only the top-$p\%$ most salient patches, reducing late-interaction computation by up to 60\% with less than 2\% nDCG@10 loss; and (3) optional binary encoding of centroid indices into $b$-bit strings ($b=\lceil\log_2 K\rceil$), enabling rapid Hamming distance-based similarity search for resource-constrained environments. Evaluated on the ViDoRe and SEC-Filings datasets, HPC-ColPali achieves 30--50\% lower query latency under HNSW indexing while maintaining high retrieval precision. When integrated into a Retrieval-Augmented Generation pipeline for legal summarization, it reduces hallucination rates by 30\% and halves end-to-end latency. These advancements establish HPC-ColPali as a scalable and efficient solution for multi-vector document retrieval across diverse applications. Code is available at https://github.com/DngBack/HPC-ColPali.
Related papers
- TurboReg: TurboClique for Robust and Efficient Point Cloud Registration [13.793023246079418]
TurboReg is built upon a novel lightweight clique, TurboClique, and a highly parallelizable Pivot-Guided Search (PGS) algorithm.<n>Experiments show that TurboReg achieves state-of-the-art performance across multiple real-world datasets.
arXiv Detail & Related papers (2025-07-02T07:50:24Z) - Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings [70.26204343623215]
ColPali/ColQwen2 encodes each page into multiple patch-level embeddings and leads to excessive memory usage.<n>This empirical study investigates methods to reduce patch embeddings per page at minimum performance degradation.
arXiv Detail & Related papers (2025-06-05T13:06:01Z) - Learn from the Past: Fast Sparse Indexing for Large Language Model Decoding [7.142158555793151]
Large language models (LLMs) continue to support increasingly longer contexts.<n>Memory demand for key-value caches during decoding grows rapidly.<n>Sparse attention mechanisms alleviate this issue by computing attention weights only for selected key-value pairs.<n>Existing methods often treat each decoding step as an independent process.<n>We propose LFPS, an acceleration method that dynamically constructs sparse indexing candidates based on historical attention patterns.
arXiv Detail & Related papers (2025-05-30T02:35:59Z) - Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models.<n>We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z) - Lightweight Correlation-Aware Table Compression [58.50312417249682]
$texttVirtual$ is a framework that integrates seamlessly with existing open formats.
Experiments on data-gov datasets show that $texttVirtual$ reduces file sizes by up to 40% compared to Apache Parquet.
arXiv Detail & Related papers (2024-10-17T22:28:07Z) - Semi-Parametric Retrieval via Binary Bag-of-Tokens Index [71.78109794895065]
SemI-parametric Disentangled Retrieval (SiDR) is a bi-encoder retrieval framework that decouples retrieval index from neural parameters.<n>SiDR supports a non-parametric tokenization index for search, achieving BM25-like indexing complexity with significantly better effectiveness.
arXiv Detail & Related papers (2024-05-03T08:34:13Z) - LeCo: Lightweight Compression via Learning Serial Correlations [9.108815508920882]
Lightweight data compression is a key technique that allows column stores to exhibit superior performance for analytical queries.
We propose LeCo (i.e., Learned Compression), a framework that uses machine learning to remove the serial redundancy in a value sequence automatically.
We observe up to 5.2x speed up in a data analytical query in the Arrow columnar execution engine and a 16% increase in RocksDB's throughput.
arXiv Detail & Related papers (2023-06-27T10:46:36Z) - Hybrid Inverted Index Is a Robust Accelerator for Dense Retrieval [25.402767809863946]
Inverted file structure is a common technique for accelerating dense retrieval.
In this work, we present the Hybrid Inverted Index (HI$2$), where the embedding clusters and salient terms work to accelerate dense retrieval.
arXiv Detail & Related papers (2022-10-11T15:12:41Z) - HyP$^2$ Loss: Beyond Hypersphere Metric Space for Multi-label Image
Retrieval [20.53316810731414]
We propose a novel metric learning framework with Hybrid Proxy-Pair Loss (HyP$2$ Loss)
The proposed HyP$2$ Loss focuses on optimizing the hypersphere space by learnable proxies and excavating data-to-data correlations of irrelevant pairs.
arXiv Detail & Related papers (2022-08-14T15:06:27Z) - Rapid Person Re-Identification via Sub-space Consistency Regularization [51.76876061721556]
Person Re-Identification (ReID) matches pedestrians across disjoint cameras.
Existing ReID methods adopting real-value feature descriptors have achieved high accuracy, but they are low in efficiency due to the slow Euclidean distance computation.
We propose a novel Sub-space Consistency Regularization (SCR) algorithm that can speed up the ReID procedure by 0.25$ times.
arXiv Detail & Related papers (2022-07-13T02:44:05Z) - Injecting Domain Adaptation with Learning-to-hash for Effective and
Efficient Zero-shot Dense Retrieval [49.98615945702959]
We evaluate LTH and vector compression techniques for improving the downstream zero-shot retrieval accuracy of the TAS-B dense retriever.
Our results demonstrate that, unlike prior work, LTH strategies when applied naively can underperform the zero-shot TAS-B dense retriever on average by up to 14% nDCG@10.
arXiv Detail & Related papers (2022-05-23T17:53:44Z) - ISTA-NAS: Efficient and Consistent Neural Architecture Search by Sparse
Coding [86.40042104698792]
We formulate neural architecture search as a sparse coding problem.
In experiments, our two-stage method on CIFAR-10 requires only 0.05 GPU-day for search.
Our one-stage method produces state-of-the-art performances on both CIFAR-10 and ImageNet at the cost of only evaluation time.
arXiv Detail & Related papers (2020-10-13T04:34:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.