Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings
- URL: http://arxiv.org/abs/2506.04997v1
- Date: Thu, 05 Jun 2025 13:06:01 GMT
- Title: Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings
- Authors: Yubo Ma, Jinsong Li, Yuhang Zang, Xiaobao Wu, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan, Jiaqi Wang, Yixin Cao, Aixin Sun,
- Abstract summary: ColPali/ColQwen2 encodes each page into multiple patch-level embeddings and leads to excessive memory usage.<n>This empirical study investigates methods to reduce patch embeddings per page at minimum performance degradation.
- Score: 70.26204343623215
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the strong performance of ColPali/ColQwen2 in Visualized Document Retrieval (VDR), it encodes each page into multiple patch-level embeddings and leads to excessive memory usage. This empirical study investigates methods to reduce patch embeddings per page at minimum performance degradation. We evaluate two token-reduction strategies: token pruning and token merging. Regarding token pruning, we surprisingly observe that a simple random strategy outperforms other sophisticated pruning methods, though still far from satisfactory. Further analysis reveals that pruning is inherently unsuitable for VDR as it requires removing certain page embeddings without query-specific information. Turning to token merging (more suitable for VDR), we search for the optimal combinations of merging strategy across three dimensions and develop Light-ColPali/ColQwen2. It maintains 98.2% of retrieval performance with only 11.8% of original memory usage, and preserves 94.6% effectiveness at 2.8% memory footprint. We expect our empirical findings and resulting Light-ColPali/ColQwen2 offer valuable insights and establish a competitive baseline for future research towards efficient VDR.
Related papers
- VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization [49.5501769221435]
Large Multimodal Models (LMMs) excel in visual-language tasks by leveraging numerous visual tokens for fine-grained visual information.<n>Previous research aimed at reducing visual tokens during inference typically leverages importance maps derived from attention scores among vision-only tokens or vision-language tokens to prune tokens across one or multiple pruning stages.<n>We propose VFlowOpt, a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism.<n> Experiments demonstrate that VFlowOpt can prune 90% of visual tokens while maintaining comparable performance, leading to an 89% reduction in KV-Cache memory and 3.8
arXiv Detail & Related papers (2025-08-07T09:47:21Z) - Hierarchical Patch Compression for ColPali: Efficient Multi-Vector Document Retrieval with Dynamic Pruning and Quantization [0.0]
Multi-vector document retrieval systems, such as ColPali, excel in fine-grained matching for complex queries but incur significant storage and computational costs.<n>We propose HPC-ColPali, a grained Patch Compression framework that enhances the efficiency of ColPali while preserving its retrieval accuracy.<n>Our approach integrates three innovative techniques: (1) K-Means quantization, which compresses patch embeddings into 1-byte centroid indices, achieving up to 32$times$ storage reduction; (2) attention-guided dynamic pruning, utilizing Vision-Language Model attention weights to retain only the top-$p%$ most
arXiv Detail & Related papers (2025-06-19T08:45:52Z) - RePCS: Diagnosing Data Memorization in LLM-Powered Retrieval-Augmented Generation [0.0]
Models may still rely on memorized training data, bypass the retrieved evidence, and produce contaminated outputs.<n>We introduce Retrieval-Path Contamination Scoring (RePCS), a diagnostic method that detects such behavior without requiring model access or retraining.
arXiv Detail & Related papers (2025-06-18T14:48:19Z) - Towards Lossless Token Pruning in Late-Interaction Retrieval Models [10.983837305643723]
Late interaction neural IR models like ColBERT offer a competitive effectiveness-efficiency trade-off across many benchmarks.<n>They require a huge memory space to store the contextual representation for all the document tokens.<n>We propose a principled approach to define how to prune tokens without impacting the score between a document and a query.
arXiv Detail & Related papers (2025-04-17T09:18:58Z) - Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference [56.71209737306054]
We propose textbfActQKV, a training-free, textbfActivation-aware approach that dynamically determines probe-textbfQuery and leverages it to retrieve the relevant textbfKV pairs for inference.<n>Experiments on the Long-Bench and $infty$ Benchmarks demonstrate its state-of-the-art performance with competitive inference quality and resource efficiency.
arXiv Detail & Related papers (2025-02-19T08:50:44Z) - Finding Needles in Emb(a)dding Haystacks: Legal Document Retrieval via Bagging and SVR Ensembles [51.0691253204425]
We introduce a retrieval approach leveraging Support Vector Regression ensembles, bootstrap aggregation (bagging), and embedding spaces on the German dataset for Legal Information Retrieval (GerDaLIR)<n>We show improved recall over the baselines using our voting ensemble, suggesting promising initial results, without training or fine-tuning any deep learning models.
arXiv Detail & Related papers (2025-01-09T07:21:44Z) - Static Pruning in Dense Retrieval using Matrix Decomposition [12.899105656025018]
In the era of dense retrieval, document indexing and retrieval is largely based on encoding models that transform text documents into embeddings.<n>Recent studies have shown that it is possible to reduce embedding size without sacrificing - and in some cases improving - the retrieval effectiveness.<n>We present a novel static pruning method for reducing the dimensionality of embeddings using Principal Components Analysis.
arXiv Detail & Related papers (2024-12-13T09:09:20Z) - ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z) - Semi-Parametric Retrieval via Binary Bag-of-Tokens Index [71.78109794895065]
SemI-parametric Disentangled Retrieval (SiDR) is a bi-encoder retrieval framework that decouples retrieval index from neural parameters.<n>SiDR supports a non-parametric tokenization index for search, achieving BM25-like indexing complexity with significantly better effectiveness.
arXiv Detail & Related papers (2024-05-03T08:34:13Z) - Lexically-Accelerated Dense Retrieval [29.327878974130055]
'LADR' (Lexically-Accelerated Dense Retrieval) is a simple-yet-effective approach that improves the efficiency of existing dense retrieval models.
LADR consistently achieves both precision and recall that are on par with an exhaustive search on standard benchmarks.
arXiv Detail & Related papers (2023-07-31T15:44:26Z) - DIP: Deep Inverse Patchmatch for High-Resolution Optical Flow [7.73554718719193]
We propose a novel Patchmatch-based framework to work on high-resolution optical flow estimation.
It can get high-precision results with lower memory benefiting from propagation and local search of Patchmatch.
Our method ranks first on all the metrics on the popular KITTI2015 benchmark, and ranks second on EPE on the Sintel clean benchmark among published optical flow methods.
arXiv Detail & Related papers (2022-04-01T10:13:59Z) - Generalized Binary Search Network for Highly-Efficient Multi-View Stereo [10.367295443948487]
Multi-view Stereo (MVS) with known camera parameters is essentially a 1D search problem within a valid depth range.
Recent deep learning-based MVS methods typically densely sample depth hypotheses in the depth range.
We propose a novel method for highly efficient MVS that remarkably decreases the memory footprint.
arXiv Detail & Related papers (2021-12-04T13:57:18Z) - ROME: Robustifying Memory-Efficient NAS via Topology Disentanglement and
Gradient Accumulation [106.04777600352743]
Differentiable architecture search (DARTS) is largely hindered by its substantial memory cost since the entire supernet resides in the memory.
The single-path DARTS comes in, which only chooses a single-path submodel at each step.
While being memory-friendly, it also comes with low computational costs.
We propose a new algorithm called RObustifying Memory-Efficient NAS (ROME) to give a cure.
arXiv Detail & Related papers (2020-11-23T06:34:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.