OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution
Queries
- URL: http://arxiv.org/abs/2211.12850v1
- Date: Sat, 22 Oct 2022 21:22:50 GMT
- Title: OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution
Queries
- Authors: Shikhar Jaiswal, Ravishankar Krishnaswamy, Ankit Garg, Harsha Vardhan
Simhadri, Sheshansh Agrawal
- Abstract summary: State-of-the-art algorithms for Approximate Nearest Neighbor Search (ANNS) build data dependent indices that offer substantially better accuracy and search efficiency over data-agnostic indices by overfitting to the index data distribution.
We present OOD-DiskANN, which uses a sparing sample (1% of index set size) of OOD queries, and provides up to 40% improvement in mean query latency over SoTA algorithms of a similar memory footprint.
- Score: 8.950168559003993
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: State-of-the-art algorithms for Approximate Nearest Neighbor Search (ANNS)
such as DiskANN, FAISS-IVF, and HNSW build data dependent indices that offer
substantially better accuracy and search efficiency over data-agnostic indices
by overfitting to the index data distribution. When the query data is drawn
from a different distribution - e.g., when index represents image embeddings
and query represents textual embeddings - such algorithms lose much of this
performance advantage. On a variety of datasets, for a fixed recall target,
latency is worse by an order of magnitude or more for Out-Of-Distribution (OOD)
queries as compared to In-Distribution (ID) queries. The question we address in
this work is whether ANNS algorithms can be made efficient for OOD queries if
the index construction is given access to a small sample set of these queries.
We answer positively by presenting OOD-DiskANN, which uses a sparing sample (1%
of index set size) of OOD queries, and provides up to 40% improvement in mean
query latency over SoTA algorithms of a similar memory footprint. OOD-DiskANN
is scalable and has the efficiency of graph-based ANNS indices. Some of our
contributions can improve query efficiency for ID queries as well.
Related papers
- Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes? [62.57689536630933]
We provide experimental results on the BEIR dataset using the open-source Lucene search library.
Our results provide guidance for today's search practitioner in understanding the design space of dense and sparse retrievers.
arXiv Detail & Related papers (2024-09-10T12:46:23Z) - RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search [11.069814476661827]
Cross-modal ANNS aims to use the data vector from one modality to retrieve the most similar items from another.
State-of-the-art ANNS approaches suffer poor performance for OOD workloads.
We propose pRojected bipartite Graph (RoarGraph), an efficient ANNS graph index built under the guidance of query distribution.
arXiv Detail & Related papers (2024-08-16T06:48:16Z) - Semi-Parametric Retrieval via Binary Token Index [71.78109794895065]
Semi-parametric Vocabulary Disentangled Retrieval (SVDR) is a novel semi-parametric retrieval framework.
It supports two types of indexes: an embedding-based index for high effectiveness, akin to existing neural retrieval methods; and a binary token index that allows for quick and cost-effective setup, resembling traditional term-based retrieval.
It achieves a 3% higher top-1 retrieval accuracy compared to the dense retriever DPR when using an embedding-based index and a 9% higher top-1 accuracy compared to BM25 when using a binary token index.
arXiv Detail & Related papers (2024-05-03T08:34:13Z) - Constructing Tree-based Index for Efficient and Effective Dense
Retrieval [26.706985694158384]
JTR stands for Joint optimization of TRee-based index and query encoding.
We design a new unified contrastive learning loss to train tree-based index and query encoder in an end-to-end manner.
Experimental results show that JTR achieves better retrieval performance while retaining high system efficiency.
arXiv Detail & Related papers (2023-04-24T09:25:39Z) - WISK: A Workload-aware Learned Index for Spatial Keyword Queries [46.96314606580924]
We propose WISK, a learned index for spatial keyword queries.
We show that WISK achieves up to 8x speedup in querying time with comparable storage overhead.
arXiv Detail & Related papers (2023-02-28T03:45:25Z) - A Learned Index for Exact Similarity Search in Metric Spaces [25.330353637669386]
LIMS is proposed to use data clustering and pivot-based data transformation techniques to build learned indexes.
Machine learning models are developed to approximate the position of each data record on the disk.
Extensive experiments on real-world and synthetic datasets demonstrate the superiority of LIMS compared with traditional indexes.
arXiv Detail & Related papers (2022-04-21T11:24:55Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - A Pluggable Learned Index Method via Sampling and Gap Insertion [48.900186573181735]
Database indexes facilitate data retrieval and benefit broad applications in real-world systems.
Recently, a new family of index, named learned index, is proposed to learn hidden yet useful data distribution.
We study two general and pluggable techniques to enhance the learning efficiency and learning effectiveness for learned indexes.
arXiv Detail & Related papers (2021-01-04T07:17:23Z) - The Case for Learned Spatial Indexes [62.88514422115702]
We use techniques proposed from a state-of-the art learned multi-dimensional index structure (namely, Flood) to answer spatial range queries.
We show that (i) machine learned search within a partition is faster by 11.79% to 39.51% than binary search when using filtering on one dimension.
We also refine using machine learned indexes is 1.23x to 1.83x times faster than closest competitor which filters on two dimensions.
arXiv Detail & Related papers (2020-08-24T12:09:55Z) - COAX: Correlation-Aware Indexing on Multidimensional Data with Soft
Functional Dependencies [3.670422696827525]
We present COAX, a learned index for multidimensional data that learns the correlations between attributes of the dataset.
We show experimentally that by predicting correlated attributes in the data, we can improve the query execution time and reduce the memory overhead of the index.
arXiv Detail & Related papers (2020-06-29T21:22:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.