Related papers: FastLane: Efficient Routed Systems for Late-Interaction Retrieval

FastLane: Efficient Routed Systems for Late-Interaction Retrieval

URL: http://arxiv.org/abs/2601.06389v2
Date: Tue, 13 Jan 2026 22:42:00 GMT
Title: FastLane: Efficient Routed Systems for Late-Interaction Retrieval
Authors: Ramnath Kumar, Prateek Jain, Cho-Jui Hsieh,
Abstract summary: FastLane is a novel retrieval framework that dynamically routes queries to their most informative representations.<n>By bridging late-interaction models with Approximate Nearest Neighbor Search (ANNS), FastLane enables scalable, low-latency retrieval.
Score: 58.060096779432094
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Late-interaction retrieval models like ColBERT achieve superior accuracy by enabling token-level interactions, but their computational cost hinders scalability and integration with Approximate Nearest Neighbor Search (ANNS). We introduce FastLane, a novel retrieval framework that dynamically routes queries to their most informative representations, eliminating redundant token comparisons. FastLane employs a learnable routing mechanism optimized alongside the embedding model, leveraging self-attention and differentiable selection to maximize efficiency. Our approach reduces computational complexity by up to 30x while maintaining competitive retrieval performance. By bridging late-interaction models with ANNS, FastLane enables scalable, low-latency retrieval, making it feasible for large-scale applications such as search engines, recommendation systems, and question-answering platforms. This work opens pathways for multi-lingual, multi-modal, and long-context retrieval, pushing the frontier of efficient and adaptive information retrieval.

Related papers

LLMRank: Understanding LLM Strengths for Model Routing [2.166956880697874]
We introduce LLMRank, a prompt-aware routing framework that leverages rich, human-readable features extracted from prompts.<n>Unlike prior one-shot routers that rely solely on latent embeddings, LLMRank predicts per-model utility using a neural ranking model trained on RouterBench.<n>Our approach achieves up to 89.2% of oracle utility, while providing interpretable feature attributions that explain routing decisions.
arXiv Detail & Related papers (2025-09-23T18:11:30Z)
Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker [0.0]
This paper explores a pragmatic approach to make vision retrieval process scalable and efficient without compromising on performance quality.<n>We propose multi-step custom implementation utilizing widely adopted hybrid search (metadata & embedding) and state of the art late interaction re-ranker to retrieve best matching pages.
arXiv Detail & Related papers (2025-07-16T16:27:05Z)
Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control [52.405085773954596]
Retrieval-Augmented Generation has emerged as a powerful approach to mitigate large language model hallucinations.<n>Existing RAG frameworks often apply retrieval indiscriminately,leading to inefficiencies-over-retrieving.<n>We introduce a novel user-controllable RAG framework that enables dynamic adjustment of the accuracy-cost trade-off.
arXiv Detail & Related papers (2025-02-17T18:56:20Z)
Retrieval with Learned Similarities [2.729516456192901]
State-of-the-art retrieval algorithms have migrated to learned similarities.<n>We show that Mixture-of-Logits (MoL) can be realized empirically to achieve superior performance on diverse retrieval scenarios.
arXiv Detail & Related papers (2024-07-22T08:19:34Z)
CART: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
Cross-modal retrieval aims to search for instances, which are semantically related to the query through the interaction of different modal data.<n>Traditional solutions utilize a single-tower or dual-tower framework to explicitly compute the score between queries and candidates.<n>We propose a generative cross-modal retrieval framework (CART) based on coarse-to-fine semantic modeling.
arXiv Detail & Related papers (2024-06-25T12:47:04Z)
Beyond Two-Tower Matching: Learning Sparse Retrievable Cross-Interactions for Recommendation [80.19762472699814]
Two-tower models are a prevalent matching framework for recommendation, which have been widely deployed in industrial applications. It suffers two main challenges, including limited feature interaction capability and reduced accuracy in online serving. We propose a new matching paradigm named SparCode, which supports not only sophisticated feature interactions but also efficient retrieval.
arXiv Detail & Related papers (2023-11-30T03:13:36Z)
RELS-DQN: A Robust and Efficient Local Search Framework for Combinatorial Optimization [11.269582666887324]
We introduce RELS-DQN, a lightweight DQN framework that exhibits the local search behavior while providing practical scalability. Using the RELS-DQN model trained on one application, it can generalize to various applications by providing solution values higher than or equal to both the local search algorithms and the existing DQN models.
arXiv Detail & Related papers (2023-04-11T18:01:49Z)
Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching. While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error. We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z)
Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval [80.35589927511667]
Current state-of-the-art approaches to cross-modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image. We propose a novel fine-tuning framework which turns any pretrained text-image multi-modal model into an efficient retrieval model. Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross-encoders.
arXiv Detail & Related papers (2021-03-22T15:08:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.