Related papers: Benchmarking Filtered Approximate Nearest Neighbor Search Algorithms on Transformer-based Embedding Vectors

Benchmarking Filtered Approximate Nearest Neighbor Search Algorithms on Transformer-based Embedding Vectors

URL: http://arxiv.org/abs/2507.21989v1
Date: Tue, 29 Jul 2025 16:39:54 GMT
Title: Benchmarking Filtered Approximate Nearest Neighbor Search Algorithms on Transformer-based Embedding Vectors
Authors: Patrick Iff, Paul Bruegger, Marcin Chrapek, Maciej Besta, Torsten Hoefler,
Abstract summary: Filtered Approximate Nearest Neighbor Search is a problem known as Filtered Approximate Nearest Neighbor Search (FANNS)<n>We present a comprehensive survey and taxonomy of FANNS methods and analyze how they are benchmarked in the literature.<n>We introduce a novel dataset consisting of embedding vectors for the abstracts of over 2.7 million research articles from the arXiv repository, accompanied by 11 real-world attributes such as authors and categories.
Score: 18.796661826646616
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Advances in embedding models for text, image, audio, and video drive progress across multiple domains, including retrieval-augmented generation, recommendation systems, vehicle/person reidentification, and face recognition. Many applications in these domains require an efficient method to retrieve items that are close to a given query in the embedding space while satisfying a filter condition based on the item's attributes, a problem known as Filtered Approximate Nearest Neighbor Search (FANNS). In this work, we present a comprehensive survey and taxonomy of FANNS methods and analyze how they are benchmarked in the literature. By doing so, we identify a key challenge in the current FANNS landscape: the lack of diverse and realistic datasets, particularly ones derived from the latest transformer-based text embedding models. To address this, we introduce a novel dataset consisting of embedding vectors for the abstracts of over 2.7 million research articles from the arXiv repository, accompanied by 11 real-world attributes such as authors and categories. We benchmark a wide range of FANNS methods on our novel dataset and find that each method has distinct strengths and limitations; no single approach performs best across all scenarios. ACORN, for example, supports various filter types and performs reliably across dataset scales but is often outperformed by more specialized methods. SeRF shows excellent performance for range filtering on ordered attributes but cannot handle categorical attributes. Filtered-DiskANN and UNG excel on the medium-scale dataset but fail on the large-scale dataset, highlighting the challenge posed by transformer-based embeddings, which are often more than an order of magnitude larger than earlier embeddings. We conclude that no universally best method exists.

Related papers

Filtered Approximate Nearest Neighbor Search in Vector Databases: System Design and Performance Analysis [0.5249805590164902]
Filtered Approximate Nearest Neighbor Search (FANNS) is used to combine semantic retrieval with metadata constraints.<n>In this work, we systematize the taxonomy of filtering strategies and evaluate their integration into vectors.<n>Our experiments reveal that algorithmic adaptations within the engine often override raw index performance.
arXiv Detail & Related papers (2026-02-11T23:40:26Z)
JAG: Joint Attribute Graphs for Filtered Nearest Neighbor Search [8.398505388767248]
JAG (Joint Attribute Graphs) is a graph-based algorithm designed to deliver robust performance across the entire selectivity spectrum.<n>Our experimental results demonstrate that JAG significantly outperforms existing state-of-the-art baselines in both throughput and recall robustness.
arXiv Detail & Related papers (2026-02-10T20:00:51Z)
Learning Filter-Aware Distance Metrics for Nearest Neighbor Search with Multiple Filters [7.493563348124023]
Filtered Approximate Nearest Neighbor (ANN) search retrieves the closest vectors for a query vector from a dataset.<n>Existing graph-based methods typically incorporate filter awareness by assigning fixed penalties.<n>We propose a principled alternative that learns the optimal trade-off between vector distance and filter match directly from the data.
arXiv Detail & Related papers (2025-11-06T05:24:41Z)
Attribute Filtering in Approximate Nearest Neighbor Search: An In-depth Experimental Study [18.5007917065799]
We present a unified Filtering ANN search interface that encompasses the latest algorithms and evaluate them extensively from multiple perspectives.<n>First, we propose a comprehensive taxonomy of existing Filtering ANN algorithms based on attribute types and filtering strategies.<n>We then conduct a broad experimental evaluation on 10 algorithms and 12 methods across 4 datasets.
arXiv Detail & Related papers (2025-08-22T09:54:57Z)
Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval [22.33550491040999]
RAG grounds large language models in external evidence, yet it still falters when answers must be pieced together across semantically distant documents.<n>We build two plug-and-play retrievers: StatementGraphRAG and TopicGraphRAG.<n>Our methods outperform naive chunk-based RAG achieving an average relative improvement of 23.1% in retrieval recall and correctness.
arXiv Detail & Related papers (2025-06-09T17:58:35Z)
MUSS: Multilevel Subset Selection for Relevance and Diversity [4.8254343133177295]
In recommender systems, one is interested in selecting relevant items, while providing a diversified recommendation.<n>We present a novel theoretical approach for analyzing this type of problems, and show that our method achieves a constant factor approximation of the optimal objective.
arXiv Detail & Related papers (2025-03-14T06:37:17Z)
RQFormer: Rotated Query Transformer for End-to-End Oriented Object Detection [26.37802649901314]
Oriented object detection presents a challenging task due to the presence of object instances with multiple orientations, varying scales, and dense distributions.<n>We propose an end-to-end oriented detector called the Rotated Query Transformer, which integrates two key technologies.<n>Experiments conducted on four remote sensing datasets and one scene text dataset demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-11-29T13:43:17Z)
Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection [2.3814052021083354]
This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem. Our method efficiently marks each item in a database as neighbor or non-neighbor of a query point, based on a cosine distance threshold without exhaustive search. We show that, using softmax-based features, our method achieves a more than ten-fold speed-up over exhaustive search with no loss of accuracy.
arXiv Detail & Related papers (2023-11-05T06:12:03Z)
infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization. infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information. In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z)
Abstractive Summarization as Augmentation for Document-Level Event Detection [0.0]
We bridge the performance gap between shallow and deep models on document-level event detection by using abstractive text summarization as an augmentation method. We use four decoding methods for text generation, namely beam search, top-k sampling, top-p sampling, and contrastive search. Our results show that using the document title offers 2.04% and 3.19% absolute improvement in macro F1-score for linear SVM and RoBERTa, respectively.
arXiv Detail & Related papers (2023-05-29T11:28:26Z)
Finding Support Examples for In-Context Learning [73.90376920653507]
We propose LENS, a fiLter-thEN-Search method to tackle this challenge in two stages. First we filter the dataset to obtain informative in-context examples individually. Then we propose diversity-guided example search which iteratively refines and evaluates the selected example permutations.
arXiv Detail & Related papers (2023-02-27T06:32:45Z)
MD-CSDNetwork: Multi-Domain Cross Stitched Network for Deepfake Detection [80.83725644958633]
Current deepfake generation methods leave discriminative artifacts in the frequency spectrum of fake images and videos. We present a novel approach, termed as MD-CSDNetwork, for combining the features in the spatial and frequency domains to mine a shared discriminative representation.
arXiv Detail & Related papers (2021-09-15T14:11:53Z)
Tradeoffs in Sentence Selection Techniques for Open-Domain Question Answering [54.541952928070344]
We describe two groups of models for sentence selection: QA-based approaches, which run a full-fledged QA system to identify answer candidates, and retrieval-based models, which find parts of each passage specifically related to each question. We show that very lightweight QA models can do well at this task, but retrieval-based models are faster still.
arXiv Detail & Related papers (2020-09-18T23:39:15Z)
A Hybrid Swarm and Gravitation based feature selection algorithm for Handwritten Indic Script Classification problem [39.11055166524374]
We introduce a new FS algorithm, called Hybrid Swarm and Gravitation based FS (HSGFS) This algorithm is made to run on 3 feature vectors introduced in the literature recently - Distance-Hough Transform (DHT), Histogram of Oriented Gradients (HOG) and Modified log-Gabor (MLG) filter Transform. Handwritten datasets, prepared at block, text-line and word level, consisting of officially recognized 12 Indic scripts are used for the evaluation of our method.
arXiv Detail & Related papers (2020-05-10T07:27:55Z)
Selecting Relevant Features from a Multi-domain Representation for Few-shot Classification [91.67977602992657]
We propose a new strategy based on feature selection, which is both simpler and more effective than previous feature adaptation approaches. We show that a simple non-parametric classifier built on top of such features produces high accuracy and generalizes to domains never seen during training.
arXiv Detail & Related papers (2020-03-20T15:44:17Z)
Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach. IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language. We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.