Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling
- URL: http://arxiv.org/abs/2409.14683v1
- Date: Mon, 23 Sep 2024 03:12:43 GMT
- Title: Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling
- Authors: Benjamin ClaviƩ, Antoine Chaffin, Griffin Adams,
- Abstract summary: Multi-vector retrieval methods, spearheaded by ColBERT, have become an increasingly popular approach to Neural IR.
However, the storage and memory requirements necessary to store the large number of associated vectors remain an important drawback.
We introduce a simple clustering-based token pooling approach to aggressively reduce the number of vectors that need to be stored.
- Score: 5.232135930253723
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Over the last few years, multi-vector retrieval methods, spearheaded by ColBERT, have become an increasingly popular approach to Neural IR. By storing representations at the token level rather than at the document level, these methods have demonstrated very strong retrieval performance, especially in out-of-domain settings. However, the storage and memory requirements necessary to store the large number of associated vectors remain an important drawback, hindering practical adoption. In this paper, we introduce a simple clustering-based token pooling approach to aggressively reduce the number of vectors that need to be stored. This method can reduce the space & memory footprint of ColBERT indexes by 50% with virtually no retrieval performance degradation. This method also allows for further reductions, reducing the vector count by 66%-to-75% , with degradation remaining below 5% on a vast majority of datasets. Importantly, this approach requires no architectural change nor query-time processing, and can be used as a simple drop-in during indexation with any ColBERT-like model.
Related papers
- ESPN: Memory-Efficient Multi-Vector Information Retrieval [0.36832029288386137]
Multi-vector models amplify memory and storage requirements for retrieval indices by an order of magnitude.
We introduce Embedding from Storage Pipelined Network (ESPN) where we offload the entire re-ranking embedding tables to reduce the memory requirements by 5-16x.
We design a software prefetcher with hit rates exceeding 90%, improving SSD based retrieval up to 6.4x, and demonstrate that we can maintain near memory levels of query latency even for large query batch sizes.
arXiv Detail & Related papers (2023-12-09T00:19:42Z) - Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data.
The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships.
A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z) - SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot
Neural Sparse Retrieval [92.27387459751309]
We provide SPRINT, a unified Python toolkit for evaluating neural sparse retrieval.
We establish strong and reproducible zero-shot sparse retrieval baselines across the well-acknowledged benchmark, BEIR.
We show that SPLADEv2 produces sparse representations with a majority of tokens outside of the original query and document.
arXiv Detail & Related papers (2023-07-19T22:48:02Z) - EcoTTA: Memory-Efficient Continual Test-time Adaptation via
Self-distilled Regularization [71.70414291057332]
TTA may primarily be conducted on edge devices with limited memory.
Long-term adaptation often leads to catastrophic forgetting and error accumulation.
We present lightweight meta networks that can adapt the frozen original networks to the target domain.
arXiv Detail & Related papers (2023-03-03T13:05:30Z) - Integral Continual Learning Along the Tangent Vector Field of Tasks [112.02761912526734]
We propose a lightweight continual learning method which incorporates information from specialized datasets incrementally.
It maintains a small fixed-size memory buffer, as low as 0.4% of the source datasets, which is updated by simple resampling.
Our method achieves strong performance across various buffer sizes for different datasets.
arXiv Detail & Related papers (2022-11-23T16:49:26Z) - CITADEL: Conditional Token Interaction via Dynamic Lexical Routing for
Efficient and Effective Multi-Vector Retrieval [72.90850213615427]
Multi-vector retrieval methods combine the merits of sparse (e.g. BM25) and dense (e.g. DPR) retrievers.
These methods are orders of magnitude slower and need much more space to store their indices compared to their single-vector counterparts.
We propose conditional token interaction via dynamic lexical routing, namely CITADEL, for efficient and effective multi-vector retrieval.
arXiv Detail & Related papers (2022-11-18T18:27:35Z) - Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized
Late Interactions using Enhanced Reduction [10.749746283569847]
ColBERTer is a neural retrieval model using contextualized late interaction (ColBERT) with enhanced reduction.
For its multi-vector component, ColBERTer reduces the number of stored per document by learning unique whole-word representations for the terms in each document.
Results on the MS MARCO and TREC-DL collection show that ColBERTer can reduce the storage footprint by up to 2.5x, while maintaining effectiveness.
arXiv Detail & Related papers (2022-03-24T14:28:07Z) - Incremental Learning of Structured Memory via Closed-Loop Transcription [20.255633973040183]
This work proposes a minimal computational model for learning structured memories of multiple object classes in an incremental setting.
Our method is simpler than existing approaches for incremental learning, and more efficient in terms of model size, storage, and computation.
Experimental results show that our method can effectively alleviate catastrophic forgetting, achieving significantly better performance than prior work of generative replay.
arXiv Detail & Related papers (2022-02-11T02:20:43Z) - Rethinking Space-Time Networks with Improved Memory Coverage for
Efficient Video Object Segmentation [68.45737688496654]
We establish correspondences directly between frames without re-encoding the mask features for every object.
With the correspondences, every node in the current query frame is inferred by aggregating features from the past in an associative fashion.
We validated that every memory node now has a chance to contribute, and experimentally showed that such diversified voting is beneficial to both memory efficiency and inference accuracy.
arXiv Detail & Related papers (2021-06-09T16:50:57Z) - CNN with large memory layers [2.368995563245609]
This work is centred around the recently proposed product key memory structure citelarge_memory, implemented for a number of computer vision applications.
The memory structure can be regarded as a simple computation primitive suitable to be augmented to nearly all neural network architectures.
arXiv Detail & Related papers (2021-01-27T20:58:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.