Related papers: COAX: Correlation-Aware Indexing on Multidimensional Data with Soft Functional Dependencies

COAX: Correlation-Aware Indexing on Multidimensional Data with Soft Functional Dependencies

URL: http://arxiv.org/abs/2006.16393v3
Date: Tue, 2 Feb 2021 15:43:57 GMT
Title: COAX: Correlation-Aware Indexing on Multidimensional Data with Soft Functional Dependencies
Authors: Ali Hadian, Behzad Ghaffari, Taiyi Wang, Thomas Heinis
Abstract summary: We present COAX, a learned index for multidimensional data that learns the correlations between attributes of the dataset. We show experimentally that by predicting correlated attributes in the data, we can improve the query execution time and reduce the memory overhead of the index.
Score: 3.670422696827525
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work proposed learned index structures, which learn the distribution of the underlying dataset to improve performance. The initial work on learned indexes has shown that by learning the cumulative distribution function of the data, index structures such as the B-Tree can improve their performance by one order of magnitude while having a smaller memory footprint. In this paper, we present COAX, a learned index for multidimensional data that, instead of learning the distribution of keys, learns the correlations between attributes of the dataset. Our approach is driven by the observation that in many datasets, values of two (or multiple) attributes are correlated. COAX exploits these correlations to reduce the dimensionality of the datasets. More precisely, we learn how to infer one (or multiple) attribute $C_d$ from the remaining attributes and hence no longer need to index attribute $C_d$. This reduces the dimensionality and hence makes the index smaller and more efficient. We theoretically investigate the effectiveness of the proposed technique based on the predictability of the FD attributes. We further show experimentally that by predicting correlated attributes in the data, we can improve the query execution time and reduce the memory overhead of the index. In our experiments, we reduce the execution time by 25% while reducing the memory footprint of the index by four orders of magnitude.

Related papers

DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
Semi-Parametric Retrieval via Binary Token Index [71.78109794895065]
Semi-parametric Vocabulary Disentangled Retrieval (SVDR) is a novel semi-parametric retrieval framework. It supports two types of indexes: an embedding-based index for high effectiveness, akin to existing neural retrieval methods; and a binary token index that allows for quick and cost-effective setup, resembling traditional term-based retrieval. It achieves a 3% higher top-1 retrieval accuracy compared to the dense retriever DPR when using an embedding-based index and a 9% higher top-1 accuracy compared to BM25 when using a binary token index.
arXiv Detail & Related papers (2024-05-03T08:34:13Z)
Compact Neural Graphics Primitives with Learned Hash Probing [100.07267906666293]
We show that a hash table with learned probes has neither disadvantage, resulting in a favorable combination of size and speed. Inference is faster than unprobed hash tables at equal quality while training is only 1.2-2.6x slower.
arXiv Detail & Related papers (2023-12-28T18:58:45Z)
OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution Queries [8.950168559003993]
State-of-the-art algorithms for Approximate Nearest Neighbor Search (ANNS) build data dependent indices that offer substantially better accuracy and search efficiency over data-agnostic indices by overfitting to the index data distribution. We present OOD-DiskANN, which uses a sparing sample (1% of index set size) of OOD queries, and provides up to 40% improvement in mean query latency over SoTA algorithms of a similar memory footprint.
arXiv Detail & Related papers (2022-10-22T21:22:50Z)
NFL: Robust Learned Index via Distribution Transformation [14.812854942243503]
This paper tackles the approximation problem by applying a textit distribution transformation to the keys before constructing the learned index. A two-stage Normalizing-Flow-based Learned index framework (NFL) is proposed, which first transforms the original complex key distribution into a near-uniform distribution, then builds a learned index leveraging the transformed keys. Based on the characteristics of the transformed keys, we propose a robust After-Flow Learned Index (AFLI)
arXiv Detail & Related papers (2022-05-24T06:03:19Z)
A Learned Index for Exact Similarity Search in Metric Spaces [25.330353637669386]
LIMS is proposed to use data clustering and pivot-based data transformation techniques to build learned indexes. Machine learning models are developed to approximate the position of each data record on the disk. Extensive experiments on real-world and synthetic datasets demonstrate the superiority of LIMS compared with traditional indexes.
arXiv Detail & Related papers (2022-04-21T11:24:55Z)
Micro-architectural Analysis of a Learned Index [0.0]
ALEX is a tree-based in-memory index structure that consists of a hierarchy of machine learned models. Our results show that ALEX exhibits fewer stalls and a lower cycles-per-instruction value across different workloads. On the other hand, the amount of instructions required to handle out-of-bound inserts in ALEX can increase the instructions needed per request significantly (10X)
arXiv Detail & Related papers (2021-09-17T12:13:06Z)
How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets. Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z)
The Case for Learned Spatial Indexes [62.88514422115702]
We use techniques proposed from a state-of-the art learned multi-dimensional index structure (namely, Flood) to answer spatial range queries. We show that (i) machine learned search within a partition is faster by 11.79% to 39.51% than binary search when using filtering on one dimension. We also refine using machine learned indexes is 1.23x to 1.83x times faster than closest competitor which filters on two dimensions.
arXiv Detail & Related papers (2020-08-24T12:09:55Z)
RadixSpline: A Single-Pass Learned Index [84.84747738666263]
We introduce RadixSpline (RS), a learned index that can be built in a single pass over the data. RS achieves competitive results on all datasets, despite the fact that it only has two parameters.
arXiv Detail & Related papers (2020-04-30T01:56:54Z)
New advances in enumerative biclustering algorithms with online partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets. The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.