Related papers: Micro-architectural Analysis of a Learned Index

Micro-architectural Analysis of a Learned Index

URL: http://arxiv.org/abs/2109.08495v1
Date: Fri, 17 Sep 2021 12:13:06 GMT
Title: Micro-architectural Analysis of a Learned Index
Authors: Mikkel M{\o}ller Andersen, P{\i}nar T\"oz\"un
Abstract summary: ALEX is a tree-based in-memory index structure that consists of a hierarchy of machine learned models. Our results show that ALEX exhibits fewer stalls and a lower cycles-per-instruction value across different workloads. On the other hand, the amount of instructions required to handle out-of-bound inserts in ALEX can increase the instructions needed per request significantly (10X)
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Since the publication of The Case for Learned Index Structures in 2018, there has been a rise in research that focuses on learned indexes for different domains and with different functionalities. While the effectiveness of learned indexes as an alternative to traditional index structures such as B+Trees have already been demonstrated by several studies, previous work tend to focus on higher-level performance metrics such as throughput and index size. In this paper, our goal is to dig deeper and investigate how learned indexes behave at a micro-architectural level compared to traditional indexes. More specifically, we focus on previously proposed learned index structure ALEX, which is a tree-based in-memory index structure that consists of a hierarchy of machine learned models. Unlike the original proposal for learned indexes, ALEX is designed from the ground up to allow updates and inserts. Therefore, it enables more dynamic workloads using learned indexes. In this work, we perform a micro-architectural analysis of ALEX and compare its behavior to the tree-based index structures that are not based on learned models, i.e., ART and B+Tree. Our results show that ALEX is bound by memory stalls, mainly stalls due to data misses from the last-level cache. Compared to ART and B+Tree, ALEX exhibits fewer stalls and a lower cycles-per-instruction value across different workloads. On the other hand, the amount of instructions required to handle out-of-bound inserts in ALEX can increase the instructions needed per request significantly (10X) for write-heavy workloads. However, the micro-architectural behavior shows that this increase in the instruction footprint exhibit high instruction-level parallelism, and, therefore, does not negatively impact the overall execution time.

Related papers

Piecewise Linear Approximation in Learned Index Structures: Theoretical and Empirical Analysis [16.350750984598797]
Piecewise Linear Approximation ($epsilon$-PLA) has emerged as a popular choice due to its simplicity and effectiveness.<n>Despite its central role in many learned indexes, the design and analysis of $epsilon$-PLA fitting algorithms remain underexplored.
arXiv Detail & Related papers (2025-06-25T05:20:54Z)
Semi-Parametric Retrieval via Binary Token Index [71.78109794895065]
Semi-parametric Vocabulary Disentangled Retrieval (SVDR) is a novel semi-parametric retrieval framework. It supports two types of indexes: an embedding-based index for high effectiveness, akin to existing neural retrieval methods; and a binary token index that allows for quick and cost-effective setup, resembling traditional term-based retrieval. It achieves a 3% higher top-1 retrieval accuracy compared to the dense retriever DPR when using an embedding-based index and a 9% higher top-1 accuracy compared to BM25 when using a binary token index.
arXiv Detail & Related papers (2024-05-03T08:34:13Z)
EHI: End-to-end Learning of Hierarchical Index for Efficient Dense Retrieval [18.15717995719973]
End-to-end Hierarchical Indexing (EHI) is a novel method for embedding-based retrieval. EHI outperforms existing state-of-the-art methods by +1.45% in MRR@10 on MS MARCO (Dev) and +8.2% in nDCG@10 on TREC DL19.
arXiv Detail & Related papers (2023-10-13T06:53:02Z)
End-to-End Learning to Index and Search in Large Output Spaces [95.16066833532396]
Extreme multi-label classification (XMC) is a popular framework for solving real-world problems. In this paper, we propose a novel method which relaxes the tree-based index to a specialized weighted graph-based index. ELIAS achieves state-of-the-art performance on several large-scale extreme classification benchmarks with millions of labels.
arXiv Detail & Related papers (2022-10-16T01:34:17Z)
LSI: A Learned Secondary Index Structure [24.324528705706104]
We introduce Learned Secondary Index (LSI), a first attempt to use learned indexes for indexing unsorted data. We show that LSI achieves comparable lookup performance to state-of-the-art secondary indexes while being up to 6x more space efficient.
arXiv Detail & Related papers (2022-05-11T20:49:44Z)
Autoregressive Search Engines: Generating Substrings as Document Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers. Previous work has explored ways to partition the search space into hierarchical structures. In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z)
CARMI: A Cache-Aware Learned Index with a Cost-based Construction Algorithm [1.9798034349981157]
We propose a cache-aware learned index (CARMI) design to improve the efficiency of the Recursive Model Index (RMI) framework. We formulate the problem of finding the optimal design of a learned index as an optimization problem and propose a dynamic programming algorithm for solving it. Experiments show that our index construction strategy can construct indexes with significantly better performance compared to baselines.
arXiv Detail & Related papers (2021-03-01T09:20:53Z)
A Pluggable Learned Index Method via Sampling and Gap Insertion [48.900186573181735]
Database indexes facilitate data retrieval and benefit broad applications in real-world systems. Recently, a new family of index, named learned index, is proposed to learn hidden yet useful data distribution. We study two general and pluggable techniques to enhance the learning efficiency and learning effectiveness for learned indexes.
arXiv Detail & Related papers (2021-01-04T07:17:23Z)
The Case for Learned Spatial Indexes [62.88514422115702]
We use techniques proposed from a state-of-the art learned multi-dimensional index structure (namely, Flood) to answer spatial range queries. We show that (i) machine learned search within a partition is faster by 11.79% to 39.51% than binary search when using filtering on one dimension. We also refine using machine learned indexes is 1.23x to 1.83x times faster than closest competitor which filters on two dimensions.
arXiv Detail & Related papers (2020-08-24T12:09:55Z)
COAX: Correlation-Aware Indexing on Multidimensional Data with Soft Functional Dependencies [3.670422696827525]
We present COAX, a learned index for multidimensional data that learns the correlations between attributes of the dataset. We show experimentally that by predicting correlated attributes in the data, we can improve the query execution time and reduce the memory overhead of the index.
arXiv Detail & Related papers (2020-06-29T21:22:15Z)
Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads [29.223401893397714]
We introduce Tsunami, which achieves up to 6X faster query performance and up to 8X smaller index size than existing learned multi-dimensional indexes.
arXiv Detail & Related papers (2020-06-23T19:25:51Z)
Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering [87.32442219333046]
We propose a simple and resource-efficient method to pretrain the paragraph encoder. Our method outperforms an existing dense retrieval method that uses 7 times more computational resources for pretraining.
arXiv Detail & Related papers (2020-04-30T18:09:50Z)
RadixSpline: A Single-Pass Learned Index [84.84747738666263]
We introduce RadixSpline (RS), a learned index that can be built in a single pass over the data. RS achieves competitive results on all datasets, despite the fact that it only has two parameters.
arXiv Detail & Related papers (2020-04-30T01:56:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.