Related papers: PRICE: A Pretrained Model for Cross-Database Cardinality Estimation

PRICE: A Pretrained Model for Cross-Database Cardinality Estimation

URL: http://arxiv.org/abs/2406.01027v1
Date: Mon, 3 Jun 2024 06:21:53 GMT
Title: PRICE: A Pretrained Model for Cross-Database Cardinality Estimation
Authors: Tianjing Zeng, Junwei Lan, Jiahong Ma, Wenqing Wei, Rong Zhu, Pengfei Li, Bolin Ding, Defu Lian, Zhewei Wei, Jingren Zhou,
Abstract summary: Cardinality estimation (CardEst) is essential for optimizing query execution plans. Recent ML-based CardEst methods achieve high accuracy but face deployment challenges due to high preparation costs. We propose PRICE, a PRetrained multI-table CardEst model, which addresses these limitations.
Score: 78.30959470441442
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cardinality estimation (CardEst) is essential for optimizing query execution plans. Recent ML-based CardEst methods achieve high accuracy but face deployment challenges due to high preparation costs and lack of transferability across databases. In this paper, we propose PRICE, a PRetrained multI-table CardEst model, which addresses these limitations. PRICE takes low-level but transferable features w.r.t. data distributions and query information and elegantly applies self-attention models to learn meta-knowledge to compute cardinality in any database. It is generally applicable to any unseen new database to attain high estimation accuracy, while its preparation cost is as little as the basic one-dimensional histogram-based CardEst methods. Moreover, PRICE can be finetuned to further enhance its performance on any specific database. We pretrained PRICE using 30 diverse datasets, completing the process in about 5 hours with a resulting model size of only about 40MB. Evaluations show that PRICE consistently outperforms existing methods, achieving the highest estimation accuracy on several unseen databases and generating faster execution plans with lower overhead. After finetuning with a small volume of databasespecific queries, PRICE could even find plans very close to the optimal ones. Meanwhile, PRICE is generally applicable to different settings such as data updates, data scaling, and query workload shifts. We have made all of our data and codes publicly available at https://github.com/StCarmen/PRICE.

Related papers

DataDecide: How to Predict Best Pretraining Data with Small Experiments [67.95896457895404]
We release models, data, and evaluations in DataDecide -- the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds.
arXiv Detail & Related papers (2025-04-15T17:02:15Z)
Value-Based Deep RL Scales Predictably [100.21834069400023]
We show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior.<n>We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym.
arXiv Detail & Related papers (2025-02-06T18:59:47Z)
Predicting Large Language Model Capabilities on Closed-Book QA Tasks Using Only Information Available Prior to Training [51.60874286674908]
We focus on predicting performance on Closed-book Question Answering (CBQA) tasks, which are closely tied to pre-training data and knowledge retention. We address three major challenges: 1) mastering the entire pre-training process, especially data construction; 2) evaluating a model's knowledge retention; and 3) predicting task-specific knowledge retention using only information available prior to training. We introduce the SMI metric, an information-theoretic measure that quantifies the relationship between pre-training data, model size, and task-specific knowledge retention.
arXiv Detail & Related papers (2025-02-06T13:23:53Z)
CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases [17.46316633654637]
Cardinality estimation is crucial for enabling high query performance in databases. There is no systematic benchmark or datasets which allows researchers to evaluate the progress made by new learned approaches. We release a benchmark, containing thousands of queries over 20 distinct real-world databases for learned cardinality estimation.
arXiv Detail & Related papers (2024-08-28T23:25:25Z)
Kepler: Robust Learning for Faster Parametric Query Optimization [5.6119420695093245]
We propose an end-to-end learning-based approach to parametric query optimization. Kepler achieves significant improvements in query runtime on multiple datasets.
arXiv Detail & Related papers (2023-06-11T22:39:28Z)
Lero: A Learning-to-Rank Query Optimizer [49.841082217997354]
We introduce a learning to rank query, called Lero, which builds on top of the native query and continuously learns to improve query optimization. Rather than building a learned from scratch, Lero is designed to leverage decades of wisdom of databases and improve the native. Lero achieves near optimal performance on several benchmarks.
arXiv Detail & Related papers (2023-02-14T07:31:11Z)
Glue: Adaptively Merging Single Table Cardinality to Estimate Join Query Size [35.1093718746362]
Cardinality estimation (CardEst) plays a significant role in generating high-quality query plans. The hardest problem in CardEst, i.e. how to estimate the join query size on multiple tables, has not been extensively solved. In this paper, we propose a very general framework, called Glue, which supports single table-wise CardEst results.
arXiv Detail & Related papers (2021-12-07T02:46:46Z)
bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model. bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z)
Cardinality Estimation in DBMS: A Comprehensive Benchmark Evaluation [43.27881697012329]
Cardinality estimation (CardEst) plays a significant role in generating high-quality query plans for a query workload. In this paper, we comprehensively and systematically compare the effectiveness of CardEst methods in a real dataset. We establish a new benchmark for CardEst, which contains a new complex real-world STATS and a diverse query STATS-CEB.
arXiv Detail & Related papers (2021-09-13T11:25:02Z)
How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets. Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z)
Robust Generalization and Safe Query-Specialization in Counterfactual Learning to Rank [62.28965622396868]
We introduce the Generalization and generalization (GENSPEC) algorithm, a robust feature-based counterfactual Learning to Rank method. Our results show that GENSPEC leads to optimal performance on queries with sufficient click data, while having robust behavior on queries with little or noisy data.
arXiv Detail & Related papers (2021-02-11T13:17:26Z)
A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration [17.75042918159419]
A cost-based algorithm is adopted in almost all current database systems. In the cost model, cardinality, the number of the numbers through an operator plays a crucial role. Due to the inaccuracy in cardinality estimation, errors in cost, and the huge plan space model, the algorithm cannot find the optimal execution plan for a complex query in a reasonable time.
arXiv Detail & Related papers (2021-01-05T13:47:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.