Col-Bandit: Zero-Shot Query-Time Pruning for Late-Interaction Retrieval
- URL: http://arxiv.org/abs/2602.02827v1
- Date: Mon, 02 Feb 2026 21:27:01 GMT
- Title: Col-Bandit: Zero-Shot Query-Time Pruning for Late-Interaction Retrieval
- Authors: Roi Pony, Adi Raz, Oshri Naparstek, Idan Friedman, Udi Barzelay,
- Abstract summary: Col-Bandit is a query-time pruning algorithm that reduces this computational burden by casting reranking as a finite-population Top-$K$ identification problem.<n>Unlike coarse-grained approaches that prune entire documents or tokens offline, Col-Bandit sparsifies the interaction matrix on the fly.<n>Experiments show that Col-Bandit preserves ranking fidelity while reducing MaxSim FLOPs by up to 5$times$.
- Score: 2.159285655678094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-vector late-interaction retrievers such as ColBERT achieve state-of-the-art retrieval quality, but their query-time cost is dominated by exhaustively computing token-level MaxSim interactions for every candidate document. While approximating late interaction with single-vector representations reduces cost, it often incurs substantial accuracy loss. We introduce Col-Bandit, a query-time pruning algorithm that reduces this computational burden by casting reranking as a finite-population Top-$K$ identification problem. Col-Bandit maintains uncertainty-aware bounds over partially observed document scores and adaptively reveals only the (document, query token) MaxSim entries needed to determine the top results under statistical decision bounds with a tunable relaxation. Unlike coarse-grained approaches that prune entire documents or tokens offline, Col-Bandit sparsifies the interaction matrix on the fly. It operates as a zero-shot, drop-in layer over standard multi-vector systems, requiring no index modifications, offline preprocessing, or model retraining. Experiments on textual (BEIR) and multimodal (REAL-MM-RAG) benchmarks show that Col-Bandit preserves ranking fidelity while reducing MaxSim FLOPs by up to 5$\times$, indicating that dense late-interaction scoring contains substantial redundancy that can be identified and pruned efficiently at query time.
Related papers
- Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation [49.48204107529758]
We define token overflow as a regime in which compressed representations no longer contain sufficient information to answer a given query.<n>In this paper, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations.<n>Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average.<n>These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.
arXiv Detail & Related papers (2026-02-12T18:15:08Z) - Pruning Minimal Reasoning Graphs for Efficient Retrieval-Augmented Generation [7.922207753581737]
We present AutoPrunedRetriever, a graph-style RAG system that persists the minimal reasoning subgraph built for earlier questions and incrementally extends it for later ones.<n>AutoPrunedRetriever stores entities and relations in a compact, ID-indexed codebook and represents questions, facts, and answers as edge sequences.<n>On our harder and TV benchmarks, AutoPrunedRetriever again ranks first, while using up two orders of magnitude fewer tokens than graph-heavy baselines.
arXiv Detail & Related papers (2026-02-04T08:48:11Z) - Quantifying Memory Use in Reinforcement Learning with Temporal Range [51.98491034847041]
Temporal Range is a model-agnostic metric that treats first-order sensitivities of multiple vector outputs across a temporal window to the input sequence as a temporal influence profile.<n>We also report Temporal Range for a compact Long Expressive Memory (LEM) policy trained on the task, using it as a proxy readout of task-level memory.
arXiv Detail & Related papers (2025-12-05T22:58:09Z) - Right Answer at the Right Time - Temporal Retrieval-Augmented Generation via Graph Summarization [23.1799368651364]
Question answering in temporal knowledge graphs requires retrieval that is both time-consistent and efficient.<n>We propose STAR-RAG, a temporal GraphRAG framework that relies on two key ideas: building a time-aligned rule graph and conducting propagation on this graph to narrow the search space.<n>This design enforces temporal proximity during retrieval, reduces the candidate set of retrieval results, and lowers token consumption without sacrificing accuracy.
arXiv Detail & Related papers (2025-10-19T05:00:04Z) - DQS: A Low-Budget Query Strategy for Enhancing Unsupervised Data-driven Anomaly Detection Approaches [3.3482093430607267]
This work integrates active learning with an existing unsupervised anomaly detection method.<n>We introduce a novel query strategy called the dissimilarity-based query strategy (DQS)
arXiv Detail & Related papers (2025-09-06T09:50:10Z) - Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - Gumbel Reranking: Differentiable End-to-End Reranker Optimization [61.16471123356738]
RAG systems rely on rerankers to identify relevant documents.<n> fine-tuning these models remains challenging due to the scarcity of annotated query-document pairs.<n>We propose Gumbel Reranking, an end-to-end training framework for rerankers aimed at minimizing the training-inference gap.
arXiv Detail & Related papers (2025-02-16T13:23:39Z) - A Reproducibility Study of PLAID [25.86500025007641]
We compare PLAID with an important baseline missing from the paper: re-ranking a lexical system.
We find that applying ColBERTv2 as a re-ranker atop an initial pool of BM25 results provides better efficiency-effectiveness trade-offs in low-latency settings.
We find that recently proposed modifications to re-ranking that pull in the neighbors of top-scoring documents overcome this limitation.
arXiv Detail & Related papers (2024-04-23T12:46:53Z) - ReAct: Temporal Action Detection with Relational Queries [84.76646044604055]
This work aims at advancing temporal action detection (TAD) using an encoder-decoder framework with action queries.
We first propose a relational attention mechanism in the decoder, which guides the attention among queries based on their relations.
Lastly, we propose to predict the localization quality of each action query at inference in order to distinguish high-quality queries.
arXiv Detail & Related papers (2022-07-14T17:46:37Z) - Optimal Clustering with Bandit Feedback [57.672609011609886]
This paper considers the problem of online clustering with bandit feedback.
It includes a novel stopping rule for sequential testing that circumvents the need to solve any NP-hard weighted clustering problem as its subroutines.
We show through extensive simulations on synthetic and real-world datasets that BOC's performance matches the lower boundally, and significantly outperforms a non-adaptive baseline algorithm.
arXiv Detail & Related papers (2022-02-09T06:05:05Z) - ColBERT: Efficient and Effective Passage Search via Contextualized Late
Interaction over BERT [24.288824715337483]
ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.
We extensively evaluate ColBERT using two recent passage search datasets.
arXiv Detail & Related papers (2020-04-27T14:21:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.