No More K-means:Single-Stage Sparse Coding for Efficient Multi-Vector Retrieval
Abstract Overview
This paper proposes Single-stage Sparse Retrieval (SSR), a multi-vector retrieval framework that replaces clustering-based dense indexing with sparse coding via sparse autoencoders. Instead of compressing token embeddings into low-dimensional dense vectors, SSR maps them into high-dimensional but highly sparse representations, allowing retrieval through neuron-level inverted indexes and sparse late interaction scoring. The method includes token-only and token-plus-[CLS] variants, as well as an accelerated SSR++ pipeline that uses coarse-to-fine pruning to reduce latency. Experiments on MS MARCO, BEIR, LoTTE, long-document ranking, and LLM-based backbones evaluate both retrieval effectiveness and system efficiency.
Novelty
The main novelty is a shift in multi-vector retrieval from density-based approximation with K-means clustering to single-stage sparse coding with inverted indexing. The paper also combines sparse autoencoding with retrieval-oriented contrastive objectives so that the sparse representations remain both reconstructive and discriminative for ranking.
Results
On the controlled BEIR evaluation, SSR-CLS achieves the best average nDCG@10 of 53.4, exceeding Splade-v3 (51.2) and PLAID (49.3), while SSR-tok reaches 17.5 ms retrieval latency and still outperforms the compared baselines in average effectiveness. The indexing pipeline is reported to be over 15x faster than ColBERTv2, and SSR shows strong robustness across settings including 9 of 13 BEIR datasets, LoTTE long-tail retrieval, long-document ranking, and Llama-embed-8B backbones.
Key Points
- SSR replaces K-means-based clustering in multi-vector retrieval with sparse autoencoder projections and neuron-level inverted indexing.
- The method improves the effectiveness-efficiency trade-off, reporting sub-20 ms retrieval and more than 15x faster indexing than clustering-based dense MVR systems.
- The empirical study covers standard benchmarks, long-tail and long-document settings, and frozen LLM backbones, suggesting the approach generalizes beyond a narrow controlled setup.