Related papers: The Impacts of Data, Ordering, and Intrinsic Dimensionality on Recall in Hierarchical Navigable Small Worlds

The Impacts of Data, Ordering, and Intrinsic Dimensionality on Recall in Hierarchical Navigable Small Worlds

URL: http://arxiv.org/abs/2405.17813v1
Date: Tue, 28 May 2024 04:16:43 GMT
Title: The Impacts of Data, Ordering, and Intrinsic Dimensionality on Recall in Hierarchical Navigable Small Worlds
Authors: Owen Pendrigh Elliott, Jesse Clark,
Abstract summary: Investigation focuses on HNSW's efficacy across a spectrum of datasets. We discover that the recall of approximate HNSW search, in comparison to exact K Nearest Neighbours (KNN) search, is linked to the vector space's intrinsic dimensionality. We observe that running popular benchmark datasets with HNSW instead of KNN can shift rankings by up to three positions for some models.
Score: 0.09208007322096533
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vector search systems, pivotal in AI applications, often rely on the Hierarchical Navigable Small Worlds (HNSW) algorithm. However, the behaviour of HNSW under real-world scenarios using vectors generated with deep learning models remains under-explored. Existing Approximate Nearest Neighbours (ANN) benchmarks and research typically has an over-reliance on simplistic datasets like MNIST or SIFT1M and fail to reflect the complexity of current use-cases. Our investigation focuses on HNSW's efficacy across a spectrum of datasets, including synthetic vectors tailored to mimic specific intrinsic dimensionalities, widely-used retrieval benchmarks with popular embedding models, and proprietary e-commerce image data with CLIP models. We survey the most popular HNSW vector databases and collate their default parameters to provide a realistic fixed parameterisation for the duration of the paper. We discover that the recall of approximate HNSW search, in comparison to exact K Nearest Neighbours (KNN) search, is linked to the vector space's intrinsic dimensionality and significantly influenced by the data insertion sequence. Our methodology highlights how insertion order, informed by measurable properties such as the pointwise Local Intrinsic Dimensionality (LID) or known categories, can shift recall by up to 12 percentage points. We also observe that running popular benchmark datasets with HNSW instead of KNN can shift rankings by up to three positions for some models. This work underscores the need for more nuanced benchmarks and design considerations in developing robust vector search systems using approximate vector search algorithms. This study presents a number of scenarios with varying real world applicability which aim to better increase understanding and future development of ANN algorithms and embedding

Related papers

Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs" [9.912508121466995]
We show that a flat navigable small world graph graph retains all of the benefits of HNSW on high-dimensional datasets. We go a step further and study emphwhy the hierarchy of HNSW provides no benefit in high dimensions.
arXiv Detail & Related papers (2024-12-02T20:04:06Z)
OPUS: Occupancy Prediction Using a Sparse Set [64.60854562502523]
We present a framework to simultaneously predict occupied locations and classes using a set of learnable queries. OPUS incorporates a suite of non-trivial strategies to enhance model performance. Our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6.1 RayIoU.
arXiv Detail & Related papers (2024-09-14T07:44:22Z)
CANDY: A Benchmark for Continuous Approximate Nearest Neighbor Search with Dynamic Data Ingestion [8.036012885171166]
We introduce CANDY, a benchmark tailored for Continuous Approximate Nearest Neighbor Search with Dynamic Data Ingestion. CANDY comprehensively assesses a wide range of AKNN algorithms, integrating advanced optimizations such as machine learning-driven inference. Our evaluations across diverse datasets demonstrate that simpler AKNN baselines often surpass more complex alternatives in terms of recall and latency.
arXiv Detail & Related papers (2024-06-28T04:46:11Z)
Approximate Nearest Neighbour Search on Dynamic Datasets: An Investigation [20.409659920455955]
Approximate k-Nearest Neighbour (ANN) methods are often used for mining information and aiding machine learning on large scale high-dimensional datasets. For applications with static datasets, runtime constraints and dataset properties can be used to empirically select an ANN method with suitable operating characteristics. Traditional evaluation approaches do not consider the computational costs of updating the index structure, as well as the rate and size of index updates.
arXiv Detail & Related papers (2024-04-30T06:21:44Z)
Minimally Supervised Learning using Topological Projections in Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs) Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU) Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z)
OutRank: Speeding up AutoML-based Model Search for Large Sparse Data sets with Cardinality-aware Feature Ranking [0.0]
We introduce OutRank, a system for versatile feature ranking and data quality-related anomaly detection. The proposed approach enables exploration of up to 300% larger feature spaces compared to AutoML-only approaches.
arXiv Detail & Related papers (2023-09-04T12:07:20Z)
Learning Implicit Feature Alignment Function for Semantic Segmentation [51.36809814890326]
Implicit Feature Alignment function (IFA) is inspired by the rapidly expanding topic of implicit neural representations. We show that IFA implicitly aligns the feature maps at different levels and is capable of producing segmentation maps in arbitrary resolutions. Our method can be combined with improvement on various architectures, and it achieves state-of-the-art accuracy trade-off on common benchmarks.
arXiv Detail & Related papers (2022-06-17T09:40:14Z)
Hierarchical Locality Sensitive Hashing for Structured Data: A Survey [8.045541999149002]
Locality Sensitive Hashing (LSH) technique has been proposed to provide accurate estimators for various similarity measures between sets or vectors. In this paper, we explore the present progress of the research into hierarchical LSH algorithms.
arXiv Detail & Related papers (2022-04-24T07:18:04Z)
Autoregressive Search Engines: Generating Substrings as Document Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers. Previous work has explored ways to partition the search space into hierarchical structures. In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z)
Tightly Coupled Learning Strategy for Weakly Supervised Hierarchical Place Recognition [0.09558392439655011]
We propose a tightly coupled learning (TCL) strategy to train triplet models. It combines global and local descriptors for joint optimization. Our lightweight unified model is better than several state-of-the-art methods.
arXiv Detail & Related papers (2022-02-14T03:20:39Z)
Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings. We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data. We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
Deep Retrieval: Learning A Retrievable Structure for Large-Scale Recommendations [21.68175843347951]
We present Deep Retrieval (DR), to learn a retrievable structure directly with user-item interaction data. DR is among the first non-ANN algorithms successfully deployed at the scale of hundreds of millions of items for industrial recommendation systems.
arXiv Detail & Related papers (2020-07-12T06:23:51Z)
Heterogeneous Network Representation Learning: A Unified Framework with Survey and Benchmark [57.10850350508929]
We aim to provide a unified framework to summarize and evaluate existing research on heterogeneous network embedding (HNE) As the first contribution, we provide a generic paradigm for the systematic categorization and analysis over the merits of various existing HNE algorithms. As the second contribution, we create four benchmark datasets with various properties regarding scale, structure, attribute/label availability, and etcfrom different sources. As the third contribution, we create friendly interfaces for 13 popular HNE algorithms, and provide all-around comparisons among them over multiple tasks and experimental settings.
arXiv Detail & Related papers (2020-04-01T03:42:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.