COAX: Correlation-Aware Indexing on Multidimensional Data with Soft
Functional Dependencies
- URL: http://arxiv.org/abs/2006.16393v3
- Date: Tue, 2 Feb 2021 15:43:57 GMT
- Title: COAX: Correlation-Aware Indexing on Multidimensional Data with Soft
Functional Dependencies
- Authors: Ali Hadian, Behzad Ghaffari, Taiyi Wang, Thomas Heinis
- Abstract summary: We present COAX, a learned index for multidimensional data that learns the correlations between attributes of the dataset.
We show experimentally that by predicting correlated attributes in the data, we can improve the query execution time and reduce the memory overhead of the index.
- Score: 3.670422696827525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work proposed learned index structures, which learn the distribution
of the underlying dataset to improve performance. The initial work on learned
indexes has shown that by learning the cumulative distribution function of the
data, index structures such as the B-Tree can improve their performance by one
order of magnitude while having a smaller memory footprint.
In this paper, we present COAX, a learned index for multidimensional data
that, instead of learning the distribution of keys, learns the correlations
between attributes of the dataset. Our approach is driven by the observation
that in many datasets, values of two (or multiple) attributes are correlated.
COAX exploits these correlations to reduce the dimensionality of the datasets.
More precisely, we learn how to infer one (or multiple) attribute $C_d$ from
the remaining attributes and hence no longer need to index attribute $C_d$.
This reduces the dimensionality and hence makes the index smaller and more
efficient.
We theoretically investigate the effectiveness of the proposed technique
based on the predictability of the FD attributes. We further show
experimentally that by predicting correlated attributes in the data, we can
improve the query execution time and reduce the memory overhead of the index.
In our experiments, we reduce the execution time by 25% while reducing the
memory footprint of the index by four orders of magnitude.
Related papers
- Semi-Parametric Retrieval via Binary Token Index [71.78109794895065]
Semi-parametric Vocabulary Disentangled Retrieval (SVDR) is a novel semi-parametric retrieval framework.
It supports two types of indexes: an embedding-based index for high effectiveness, akin to existing neural retrieval methods; and a binary token index that allows for quick and cost-effective setup, resembling traditional term-based retrieval.
It achieves a 3% higher top-1 retrieval accuracy compared to the dense retriever DPR when using an embedding-based index and a 9% higher top-1 accuracy compared to BM25 when using a binary token index.
arXiv Detail & Related papers (2024-05-03T08:34:13Z) - Compact Neural Graphics Primitives with Learned Hash Probing [100.07267906666293]
We show that a hash table with learned probes has neither disadvantage, resulting in a favorable combination of size and speed.
Inference is faster than unprobed hash tables at equal quality while training is only 1.2-2.6x slower.
arXiv Detail & Related papers (2023-12-28T18:58:45Z) - OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution
Queries [8.950168559003993]
State-of-the-art algorithms for Approximate Nearest Neighbor Search (ANNS) build data dependent indices that offer substantially better accuracy and search efficiency over data-agnostic indices by overfitting to the index data distribution.
We present OOD-DiskANN, which uses a sparing sample (1% of index set size) of OOD queries, and provides up to 40% improvement in mean query latency over SoTA algorithms of a similar memory footprint.
arXiv Detail & Related papers (2022-10-22T21:22:50Z) - NFL: Robust Learned Index via Distribution Transformation [14.812854942243503]
This paper tackles the approximation problem by applying a textit distribution transformation to the keys before constructing the learned index.
A two-stage Normalizing-Flow-based Learned index framework (NFL) is proposed, which first transforms the original complex key distribution into a near-uniform distribution, then builds a learned index leveraging the transformed keys.
Based on the characteristics of the transformed keys, we propose a robust After-Flow Learned Index (AFLI)
arXiv Detail & Related papers (2022-05-24T06:03:19Z) - A Learned Index for Exact Similarity Search in Metric Spaces [25.330353637669386]
LIMS is proposed to use data clustering and pivot-based data transformation techniques to build learned indexes.
Machine learning models are developed to approximate the position of each data record on the disk.
Extensive experiments on real-world and synthetic datasets demonstrate the superiority of LIMS compared with traditional indexes.
arXiv Detail & Related papers (2022-04-21T11:24:55Z) - Micro-architectural Analysis of a Learned Index [0.0]
ALEX is a tree-based in-memory index structure that consists of a hierarchy of machine learned models.
Our results show that ALEX exhibits fewer stalls and a lower cycles-per-instruction value across different workloads.
On the other hand, the amount of instructions required to handle out-of-bound inserts in ALEX can increase the instructions needed per request significantly (10X)
arXiv Detail & Related papers (2021-09-17T12:13:06Z) - How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets.
Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z) - The Case for Learned Spatial Indexes [62.88514422115702]
We use techniques proposed from a state-of-the art learned multi-dimensional index structure (namely, Flood) to answer spatial range queries.
We show that (i) machine learned search within a partition is faster by 11.79% to 39.51% than binary search when using filtering on one dimension.
We also refine using machine learned indexes is 1.23x to 1.83x times faster than closest competitor which filters on two dimensions.
arXiv Detail & Related papers (2020-08-24T12:09:55Z) - RadixSpline: A Single-Pass Learned Index [84.84747738666263]
We introduce RadixSpline (RS), a learned index that can be built in a single pass over the data.
RS achieves competitive results on all datasets, despite the fact that it only has two parameters.
arXiv Detail & Related papers (2020-04-30T01:56:54Z) - New advances in enumerative biclustering algorithms with online
partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets.
The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.