The Curious Case of High-Dimensional Indexing as a File Structure: A Case Study of eCP-FS
- URL: http://arxiv.org/abs/2507.21939v1
- Date: Tue, 29 Jul 2025 15:51:44 GMT
- Title: The Curious Case of High-Dimensional Indexing as a File Structure: A Case Study of eCP-FS
- Authors: Omar Shahbaz Khan, Gylfi Þór Guðmundsson, Björn Þór Jónsson,
- Abstract summary: eCP-FS is a file-based implementation of eCP, a well-known disk-based ANN index.<n>This paper presents eCP-FS, a file-based implementation of eCP, a well-known disk-based ANN index.<n>In a memory-constrained scenario, eCP-FS offers a minimal memory footprint, making it ideal for resource-constrained or multi-index environments.
- Score: 0.8998543739618077
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern analytical pipelines routinely deploy multiple deep learning and retrieval models that rely on approximate nearest-neighbor (ANN) indexes to support efficient similarity-based search. While many state-of-the-art ANN-indexes are memory-based (e.g., HNSW and IVF), using multiple ANN indexes creates a competition for limited GPU/CPU memory resources, which in turn necessitates disk-based index structures (e.g., DiskANN or eCP). In typical index implementations, the main component is a complex data structure that is serialized to disk and is read either fully at startup time, for memory-based indexes, or incrementally at query time, for disk-based indexes. To visualize the index structure, or analyze its quality, complex coding is needed that is either embedded in the index implementation or replicates the code that reads the data structure. In this paper, we consider an alternative approach that maps the data structure to a file structure, using a file library, making the index easily readable for any programming language and even human-readable. The disadvantage is that the serialized index is verbose, leading to overhead of searching through the index. The question addressed in this paper is how severe this performance penalty is. To that end, this paper presents eCP-FS, a file-based implementation of eCP, a well-known disk-based ANN index. A comparison with state-of-the-art indexes shows that while eCP-FS is slower, the implementation is nevertheless somewhat competitive even when memory is not constrained. In a memory-constrained scenario, eCP-FS offers a minimal memory footprint, making it ideal for resource-constrained or multi-index environments.
Related papers
- Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index [124.68209298883296]
We present Infini-gram mini, a scalable system that can make petabyte-level text corpora searchable.<n>We index 46TB of Internet text in 50 days with a single 128-core CPU node.<n>We show one important use case of Infini-gram mini in a large-scale analysis of benchmark contamination.
arXiv Detail & Related papers (2025-06-13T21:13:57Z) - LEANN: A Low-Storage Vector Index [70.13770593890655]
LEANN is a storage-efficient approximate nearest neighbor search index optimized for resource-constrained personal devices.<n>Our evaluation shows that LEANN reduces index size to under 5% of the original raw data, achieving up to 50 times smaller storage than standard indexes.
arXiv Detail & Related papers (2025-06-09T22:43:30Z) - HAKES: Scalable Vector Database for Embedding Search Service [16.034584281180006]
We build a vector database that achieves high throughput and high recall under concurrent read-write workloads.<n>Our index outperforms index baselines in the high recall region and under concurrent read-write workloads.<n>namesys is scalable and achieves up to $16times$ higher throughputs than the baselines.
arXiv Detail & Related papers (2025-05-18T19:26:29Z) - Semi-Parametric Retrieval via Binary Bag-of-Tokens Index [71.78109794895065]
SemI-parametric Disentangled Retrieval (SiDR) is a bi-encoder retrieval framework that decouples retrieval index from neural parameters.<n>SiDR supports a non-parametric tokenization index for search, achieving BM25-like indexing complexity with significantly better effectiveness.
arXiv Detail & Related papers (2024-05-03T08:34:13Z) - WISK: A Workload-aware Learned Index for Spatial Keyword Queries [46.96314606580924]
We propose WISK, a learned index for spatial keyword queries.
We show that WISK achieves up to 8x speedup in querying time with comparable storage overhead.
arXiv Detail & Related papers (2023-02-28T03:45:25Z) - End-to-End Learning to Index and Search in Large Output Spaces [95.16066833532396]
Extreme multi-label classification (XMC) is a popular framework for solving real-world problems.
In this paper, we propose a novel method which relaxes the tree-based index to a specialized weighted graph-based index.
ELIAS achieves state-of-the-art performance on several large-scale extreme classification benchmarks with millions of labels.
arXiv Detail & Related papers (2022-10-16T01:34:17Z) - Bridging the Gap Between Indexing and Retrieval for Differentiable
Search Index with Query Generation [98.02743096197402]
Differentiable Search Index (DSI) is an emerging paradigm for information retrieval.
We propose a simple yet effective indexing framework for DSI, called DSI-QG.
Empirical results on popular mono-lingual and cross-lingual passage retrieval datasets show that DSI-QG significantly outperforms the original DSI model.
arXiv Detail & Related papers (2022-06-21T06:21:23Z) - LSI: A Learned Secondary Index Structure [24.324528705706104]
We introduce Learned Secondary Index (LSI), a first attempt to use learned indexes for indexing unsorted data.
We show that LSI achieves comparable lookup performance to state-of-the-art secondary indexes while being up to 6x more space efficient.
arXiv Detail & Related papers (2022-05-11T20:49:44Z) - A Learned Index for Exact Similarity Search in Metric Spaces [25.330353637669386]
LIMS is proposed to use data clustering and pivot-based data transformation techniques to build learned indexes.
Machine learning models are developed to approximate the position of each data record on the disk.
Extensive experiments on real-world and synthetic datasets demonstrate the superiority of LIMS compared with traditional indexes.
arXiv Detail & Related papers (2022-04-21T11:24:55Z) - A Semantic Indexing Structure for Image Retrieval [9.889773269004241]
We propose a new classification-based indexing structure, called Semantic Indexing Structure (SIS)
SIS uses semantic categories rather than clustering centers to create database partitions.
SIS achieves outstanding performance compared with state-of-the-art models.
arXiv Detail & Related papers (2021-09-14T11:12:30Z) - Partial 3D Object Retrieval using Local Binary QUICCI Descriptors and
Dissimilarity Tree Indexing [2.922007656878633]
A complete pipeline is presented for accurate and efficient partial 3D object retrieval based on Quick Intersection Count Change Image (QUICCI)
It is shown how a modification to the QUICCI query descriptor makes it ideal for partial retrieval.
An indexing structure called Dissimilarity Tree is proposed which can significantly accelerate searching the large space of local descriptors.
arXiv Detail & Related papers (2021-07-07T17:30:47Z) - The Case for Learned Spatial Indexes [62.88514422115702]
We use techniques proposed from a state-of-the art learned multi-dimensional index structure (namely, Flood) to answer spatial range queries.
We show that (i) machine learned search within a partition is faster by 11.79% to 39.51% than binary search when using filtering on one dimension.
We also refine using machine learned indexes is 1.23x to 1.83x times faster than closest competitor which filters on two dimensions.
arXiv Detail & Related papers (2020-08-24T12:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.