RadixSpline: A Single-Pass Learned Index
- URL: http://arxiv.org/abs/2004.14541v2
- Date: Fri, 22 May 2020 21:01:04 GMT
- Title: RadixSpline: A Single-Pass Learned Index
- Authors: Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons
Kemper, Tim Kraska, Thomas Neumann
- Abstract summary: We introduce RadixSpline (RS), a learned index that can be built in a single pass over the data.
RS achieves competitive results on all datasets, despite the fact that it only has two parameters.
- Score: 84.84747738666263
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research has shown that learned models can outperform state-of-the-art
index structures in size and lookup performance. While this is a very promising
result, existing learned structures are often cumbersome to implement and are
slow to build. In fact, most approaches that we are aware of require multiple
training passes over the data.
We introduce RadixSpline (RS), a learned index that can be built in a single
pass over the data and is competitive with state-of-the-art learned index
models, like RMI, in size and lookup performance. We evaluate RS using the SOSD
benchmark and show that it achieves competitive results on all datasets,
despite the fact that it only has two parameters.
Related papers
- Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? [60.50127555651554]
Large Language Models (LLMs) show impressive results in numerous practical applications, but they lack essential safety features.
This makes them vulnerable to manipulations such as indirect prompt injections and generally unsuitable for safety-critical tasks.
We introduce a formal measure for instruction-data separation and an empirical variant that is calculable from a model's outputs.
arXiv Detail & Related papers (2024-03-11T15:48:56Z) - SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot
Neural Sparse Retrieval [92.27387459751309]
We provide SPRINT, a unified Python toolkit for evaluating neural sparse retrieval.
We establish strong and reproducible zero-shot sparse retrieval baselines across the well-acknowledged benchmark, BEIR.
We show that SPLADEv2 produces sparse representations with a majority of tokens outside of the original query and document.
arXiv Detail & Related papers (2023-07-19T22:48:02Z) - Learning Phone Recognition from Unpaired Audio and Phone Sequences Based
on Generative Adversarial Network [58.82343017711883]
This paper investigates how to learn directly from unpaired phone sequences and speech utterances.
GAN training is adopted in the first stage to find the mapping relationship between unpaired speech and phone sequence.
In the second stage, another HMM model is introduced to train from the generator's output, which boosts the performance.
arXiv Detail & Related papers (2022-07-29T09:29:28Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval [11.38022203865326]
SPLADE model provides highly sparse representations and competitive results with respect to state-of-the-art dense and sparse approaches.
We modify the pooling mechanism, benchmark a model solely based on document expansion, and introduce models trained with distillation.
Overall, SPLADE is considerably improved with more than $9$% gains on NDCG@10 on TREC DL 2019, leading to state-of-the-art results on the BEIR benchmark.
arXiv Detail & Related papers (2021-09-21T10:43:42Z) - The Price of Tailoring the Index to Your Data: Poisoning Attacks on
Learned Index Structures [9.567119607658299]
We present the first study of poisoning attacks on learned index structures.
We formulate the first poisoning attacks on linear regression models trained on a cumulative distribution function.
We generalize our poisoning techniques to attack a more advanced two-stage design of learned index structures.
arXiv Detail & Related papers (2020-08-01T17:12:04Z) - COAX: Correlation-Aware Indexing on Multidimensional Data with Soft
Functional Dependencies [3.670422696827525]
We present COAX, a learned index for multidimensional data that learns the correlations between attributes of the dataset.
We show experimentally that by predicting correlated attributes in the data, we can improve the query execution time and reduce the memory overhead of the index.
arXiv Detail & Related papers (2020-06-29T21:22:15Z) - A Close Look at Deep Learning with Small Data [0.0]
We show that model complexity is a critical factor when only a few samples per class are available.
We also show that even standard data augmentation can boost recognition performance by large margins.
arXiv Detail & Related papers (2020-03-28T17:11:29Z) - Stance Detection Benchmark: How Robust Is Your Stance Detection? [65.91772010586605]
Stance Detection (StD) aims to detect an author's stance towards a certain topic or claim.
We introduce a StD benchmark that learns from ten StD datasets of various domains in a multi-dataset learning setting.
Within this benchmark setup, we are able to present new state-of-the-art results on five of the datasets.
arXiv Detail & Related papers (2020-01-06T13:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.