CleANN: Efficient Full Dynamism in Graph-based Approximate Nearest Neighbor Search
- URL: http://arxiv.org/abs/2507.19802v1
- Date: Sat, 26 Jul 2025 05:27:32 GMT
- Title: CleANN: Efficient Full Dynamism in Graph-based Approximate Nearest Neighbor Search
- Authors: Ziyu Zhang, Yuanhao Wei, Joshua Engels, Julian Shun,
- Abstract summary: Approximate nearest neighbor search (ANNS) has become a quintessential algorithmic problem for various other foundational data tasks for AI workloads.<n>Most existing graph-based indexes are designed for the static scenario, where there are no updates to the data after the index is constructed.<n>CleANN is the first concurrent ANNS index to achieve such efficiency while maintaining quality under full dynamism.
- Score: 6.319134122855477
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Approximate nearest neighbor search (ANNS) has become a quintessential algorithmic problem for various other foundational data tasks for AI workloads. Graph-based ANNS indexes have superb empirical trade-offs in indexing cost, query efficiency, and query approximation quality. Most existing graph-based indexes are designed for the static scenario, where there are no updates to the data after the index is constructed. However, full dynamism (insertions, deletions, and searches) is crucial to providing up-to-date responses in applications using vector databases. It is desirable that the index efficiently supports updates and search queries concurrently. Existing dynamic graph-based indexes suffer from at least one of the following problems: (1) the query quality degrades as updates happen; and (2) the graph structure updates used to maintain the index quality upon updates are global and thus expensive. To solve these problems, we propose the CleANN system which consists of three main components: (1) workload-aware linking of diverse search tree descendants to combat distribution shift; (2)query-adaptive on-the-fly neighborhood consolidation to efficiently handle deleted nodes; and (3) semi-lazy memory cleaning to clean up stale information in the data structure and reduce the work spent by the first two components. We evaluate CleANN on 7 diverse datasets on fully dynamic workloads and find that CleANN has query quality at least as good as if the index had been built statically using the corresponding data. In the in-memory setting using 56 hyper-threads, with all types of queries running concurrently, at the same recall level, CleANN achieves 7-1200x throughput improvement on million-scale real-world datasets. To the best of our knowledge, CleANN is the first concurrent ANNS index to achieve such efficiency while maintaining quality under full dynamism.
Related papers
- Empowering Graph-based Approximate Nearest Neighbor Search with Adaptive Awareness Capabilities [19.352675865525395]
This paper proposes GATE, high-tier proximity Graph with Adaptive Topology and Query AwarEness.<n>Gate achieves a 1.2-2.0X speed-up in query performance compared to state-of-the-art graph-based indexes.
arXiv Detail & Related papers (2025-06-19T03:07:12Z) - HAKES: Scalable Vector Database for Embedding Search Service [16.034584281180006]
We build a vector database that achieves high throughput and high recall under concurrent read-write workloads.<n>Our index outperforms index baselines in the high recall region and under concurrent read-write workloads.<n>namesys is scalable and achieves up to $16times$ higher throughputs than the baselines.
arXiv Detail & Related papers (2025-05-18T19:26:29Z) - In-Place Updates of a Graph Index for Streaming Approximate Nearest Neighbor Search [12.092920351505036]
IP-DiskANN is first algorithm to avoid batch consolidation by efficiently processing each insertion and deletion in-place.<n>It has stable recall over various lengthy update patterns in both high-recall and low-recall regimes.
arXiv Detail & Related papers (2025-02-19T15:41:08Z) - Semi-Parametric Retrieval via Binary Bag-of-Tokens Index [71.78109794895065]
SemI-parametric Disentangled Retrieval (SiDR) is a bi-encoder retrieval framework that decouples retrieval index from neural parameters.<n>SiDR supports a non-parametric tokenization index for search, achieving BM25-like indexing complexity with significantly better effectiveness.
arXiv Detail & Related papers (2024-05-03T08:34:13Z) - Approximate Nearest Neighbour Search on Dynamic Datasets: An Investigation [20.409659920455955]
Approximate k-Nearest Neighbour (ANN) methods are often used for mining information and aiding machine learning on large scale high-dimensional datasets.<n>For applications with static datasets, runtime constraints and dataset properties can be used to empirically select an ANN method with suitable operating characteristics.<n>Traditional evaluation approaches do not consider the computational costs of updating the index structure, as well as the rate and size of index updates.
arXiv Detail & Related papers (2024-04-30T06:21:44Z) - LIST: Learning to Index Spatio-Textual Data for Embedding based Spatial Keyword Queries [53.843367588870585]
List K-kNN spatial keyword queries (TkQs) return a list of objects based on a ranking function that considers both spatial and textual relevance.
There are two key challenges in building an effective and efficient index, i.e., the absence of high-quality labels and the unbalanced results.
We develop a novel pseudolabel generation technique to address the two challenges.
arXiv Detail & Related papers (2024-03-12T05:32:33Z) - On Exploring Node-feature and Graph-structure Diversities for Node Drop
Graph Pooling [86.65151066870739]
Current node drop pooling methods ignore the graph diversity in terms of the node features and the graph structures, thus resulting in suboptimal graph-level representations.
We propose a novel plug-and-play score scheme and refer to it as MID, which consists of a textbfMulti score space with two operations, textiti.e., fltextbfIpscore and textbfDropscore.
Specifically, the multidimensional score space depicts the significance of nodes through multiple criteria; the flipscore encourages the maintenance of dissimilar node
arXiv Detail & Related papers (2023-06-22T08:02:01Z) - OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution
Queries [8.950168559003993]
State-of-the-art algorithms for Approximate Nearest Neighbor Search (ANNS) build data dependent indices that offer substantially better accuracy and search efficiency over data-agnostic indices by overfitting to the index data distribution.
We present OOD-DiskANN, which uses a sparing sample (1% of index set size) of OOD queries, and provides up to 40% improvement in mean query latency over SoTA algorithms of a similar memory footprint.
arXiv Detail & Related papers (2022-10-22T21:22:50Z) - End-to-End Learning to Index and Search in Large Output Spaces [95.16066833532396]
Extreme multi-label classification (XMC) is a popular framework for solving real-world problems.
In this paper, we propose a novel method which relaxes the tree-based index to a specialized weighted graph-based index.
ELIAS achieves state-of-the-art performance on several large-scale extreme classification benchmarks with millions of labels.
arXiv Detail & Related papers (2022-10-16T01:34:17Z) - QD-GCN: Query-Driven Graph Convolutional Networks for Attributed
Community Search [54.42038098426504]
QD-GCN is an end-to-end framework that unifies the community structure as well as node attributes to solve the ACS problem.
We show that QD-GCN outperforms existing attributed community search algorithms in terms of both efficiency and effectiveness.
arXiv Detail & Related papers (2021-04-08T07:52:48Z) - IRLI: Iterative Re-partitioning for Learning to Index [104.72641345738425]
Methods have to trade between obtaining high accuracy while maintaining load balance and scalability in distributed settings.
We propose a novel approach called IRLI, which iteratively partitions the items by learning the relevant buckets directly from the query-item relevance data.
We mathematically show that IRLI retrieves the correct item with high probability under very natural assumptions and provides superior load balancing.
arXiv Detail & Related papers (2021-03-17T23:13:25Z) - The Case for Learned Spatial Indexes [62.88514422115702]
We use techniques proposed from a state-of-the art learned multi-dimensional index structure (namely, Flood) to answer spatial range queries.
We show that (i) machine learned search within a partition is faster by 11.79% to 39.51% than binary search when using filtering on one dimension.
We also refine using machine learned indexes is 1.23x to 1.83x times faster than closest competitor which filters on two dimensions.
arXiv Detail & Related papers (2020-08-24T12:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.