A Pluggable Learned Index Method via Sampling and Gap Insertion
- URL: http://arxiv.org/abs/2101.00808v1
- Date: Mon, 4 Jan 2021 07:17:23 GMT
- Title: A Pluggable Learned Index Method via Sampling and Gap Insertion
- Authors: Yaliang Li, Daoyuan Chen, Bolin Ding, Kai Zeng, Jingren Zhou
- Abstract summary: Database indexes facilitate data retrieval and benefit broad applications in real-world systems.
Recently, a new family of index, named learned index, is proposed to learn hidden yet useful data distribution.
We study two general and pluggable techniques to enhance the learning efficiency and learning effectiveness for learned indexes.
- Score: 48.900186573181735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Database indexes facilitate data retrieval and benefit broad applications in
real-world systems. Recently, a new family of index, named learned index, is
proposed to learn hidden yet useful data distribution and incorporate such
information into the learning of indexes, which leads to promising performance
improvements. However, the "learning" process of learned indexes is still
under-explored. In this paper, we propose a formal machine learning based
framework to quantify the index learning objective, and study two general and
pluggable techniques to enhance the learning efficiency and learning
effectiveness for learned indexes. With the guidance of the formal learning
objective, we can efficiently learn index by incorporating the proposed
sampling technique, and learn precise index with enhanced generalization
ability brought by the proposed result-driven gap insertion technique.
We conduct extensive experiments on real-world datasets and compare several
indexing methods from the perspective of the index learning objective. The
results show the ability of the proposed framework to help to design suitable
indexes for different scenarios. Further, we demonstrate the effectiveness of
the proposed sampling technique, which achieves up to 78x construction speedup
while maintaining non-degraded indexing performance. Finally, we show the gap
insertion technique can enhance both the static and dynamic indexing
performances of existing learned index methods with up to 1.59x query speedup.
We will release our codes and processed data for further study, which can
enable more exploration of learned indexes from both the perspectives of
machine learning and database.
Related papers
- Knowledge Navigator: LLM-guided Browsing Framework for Exploratory Search in Scientific Literature [48.572336666741194]
We present Knowledge Navigator, a system designed to enhance exploratory search abilities.
It organizes retrieved documents into a navigable, two-level hierarchy of named and descriptive scientific topics and subtopics.
arXiv Detail & Related papers (2024-08-28T14:48:37Z) - Semi-Parametric Retrieval via Binary Token Index [71.78109794895065]
Semi-parametric Vocabulary Disentangled Retrieval (SVDR) is a novel semi-parametric retrieval framework.
It supports two types of indexes: an embedding-based index for high effectiveness, akin to existing neural retrieval methods; and a binary token index that allows for quick and cost-effective setup, resembling traditional term-based retrieval.
It achieves a 3% higher top-1 retrieval accuracy compared to the dense retriever DPR when using an embedding-based index and a 9% higher top-1 accuracy compared to BM25 when using a binary token index.
arXiv Detail & Related papers (2024-05-03T08:34:13Z) - A Large Scale Search Dataset for Unbiased Learning to Rank [51.97967284268577]
We introduce the Baidu-ULTR dataset for unbiased learning to rank.
It involves randomly sampled 1.2 billion searching sessions and 7,008 expert annotated queries.
It provides: (1) the original semantic feature and a pre-trained language model for easy usage; (2) sufficient display information such as position, displayed height, and displayed abstract; and (3) rich user feedback on search result pages (SERPs) like dwelling time.
arXiv Detail & Related papers (2022-07-07T02:37:25Z) - LSI: A Learned Secondary Index Structure [24.324528705706104]
We introduce Learned Secondary Index (LSI), a first attempt to use learned indexes for indexing unsorted data.
We show that LSI achieves comparable lookup performance to state-of-the-art secondary indexes while being up to 6x more space efficient.
arXiv Detail & Related papers (2022-05-11T20:49:44Z) - A Learned Index for Exact Similarity Search in Metric Spaces [25.330353637669386]
LIMS is proposed to use data clustering and pivot-based data transformation techniques to build learned indexes.
Machine learning models are developed to approximate the position of each data record on the disk.
Extensive experiments on real-world and synthetic datasets demonstrate the superiority of LIMS compared with traditional indexes.
arXiv Detail & Related papers (2022-04-21T11:24:55Z) - Learned Indexes for a Google-scale Disk-based Database [23.93643265060042]
We show how a learned index can be integrated in a distributed, disk-based database system: Google's Bigtable.
Our results show that integrating learned index significantly improves the end-to-end read latency and throughput for Bigtable.
arXiv Detail & Related papers (2020-12-23T05:56:45Z) - The Case for Learned Spatial Indexes [62.88514422115702]
We use techniques proposed from a state-of-the art learned multi-dimensional index structure (namely, Flood) to answer spatial range queries.
We show that (i) machine learned search within a partition is faster by 11.79% to 39.51% than binary search when using filtering on one dimension.
We also refine using machine learned indexes is 1.23x to 1.83x times faster than closest competitor which filters on two dimensions.
arXiv Detail & Related papers (2020-08-24T12:09:55Z) - Learning from Data to Speed-up Sorted Table Search Procedures:
Methodology and Practical Guidelines [0.0]
We study to what extend Machine Learning Techniques can contribute to obtain such a speed-up.
We characterize the scenarios in which those latter can be profitably used with respect to the former, accounting for both CPU and GPU computing.
Indeed, we formalize an Algorithmic Paradigm of Learned Dichotomic Sorted Table Search procedures that naturally complements the Learned one proposed here and that characterizes most of the known Sorted Table Search Procedures as having a "learning phase" that approximates Simple Linear Regression.
arXiv Detail & Related papers (2020-07-20T16:26:54Z) - Post-Estimation Smoothing: A Simple Baseline for Learning with Side
Information [102.18616819054368]
We propose a post-estimation smoothing operator as a fast and effective method for incorporating structural index data into prediction.
Because the smoothing step is separate from the original predictor, it applies to a broad class of machine learning tasks.
Our experiments on large scale spatial and temporal datasets highlight the speed and accuracy of post-estimation smoothing in practice.
arXiv Detail & Related papers (2020-03-12T18:04:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.