A Learned Cost Model-based Cross-engine Optimizer for SQL Workloads
- URL: http://arxiv.org/abs/2506.02802v1
- Date: Tue, 03 Jun 2025 12:32:56 GMT
- Title: A Learned Cost Model-based Cross-engine Optimizer for SQL Workloads
- Authors: AndrĂ¡s Strausz, Niels Pardon, Ioana Giurgiu,
- Abstract summary: Lakehouse systems enable the same data to be queried with multiple execution engines.<n>We propose a cross-engine that can automate engine selection for diverse queries through a learned cost model.<n>We show that using a query optimized logical plan for cost estimation decreases the average Q-error by even 12.6% over using unoptimized plans as input.
- Score: 3.7960472831772765
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Lakehouse systems enable the same data to be queried with multiple execution engines. However, selecting the engine best suited to run a SQL query still requires a priori knowledge of the query computational requirements and an engine capability, a complex and manual task that only becomes more difficult with the emergence of new engines and workloads. In this paper, we address this limitation by proposing a cross-engine optimizer that can automate engine selection for diverse SQL queries through a learned cost model. Optimized with hints, a query plan is used for query cost prediction and routing. Cost prediction is formulated as a multi-task learning problem, and multiple predictor heads, corresponding to different engines and provisionings, are used in the model architecture. This eliminates the need to train engine-specific models and allows the flexible addition of new engines at a minimal fine-tuning cost. Results on various databases and engines show that using a query optimized logical plan for cost estimation decreases the average Q-error by even 12.6% over using unoptimized plans as input. Moreover, the proposed cross-engine optimizer reduces the total workload runtime by up to 25.2% in a zero-shot setting and 30.4% in a few-shot setting when compared to random routing.
Related papers
- DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding [76.3876070043663]
We propose DriveLMM-o1, a dataset and benchmark designed to advance step-wise visual reasoning for autonomous driving.<n>Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning.<n>Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model.
arXiv Detail & Related papers (2025-03-13T17:59:01Z) - Improving DBMS Scheduling Decisions with Fine-grained Performance Prediction on Concurrent Queries -- Extended [15.354441937462271]
This work introduces IconqSched, a new, principled non-intrusive scheduler that optimize execution order and timing of queries.<n>IconqSched features a novel fine-grained predictor, Iconq, which treats the system runtime as a black box.<n>We compare IconqSched to other schedulers in terms of end-to-end runtime using real workload traces.
arXiv Detail & Related papers (2025-01-27T17:55:39Z) - Budget-aware Query Tuning: An AutoML Perspective [14.561951257365953]
Modern database systems rely on cost-based querys to come up with good execution plans for input queries.
We show that by varying the costunit values one can obtain query plans that significantly outperform the default query plans.
arXiv Detail & Related papers (2024-03-29T20:19:36Z) - JoinGym: An Efficient Query Optimization Environment for Reinforcement
Learning [58.71541261221863]
Join order selection (JOS) is the problem of ordering join operations to minimize total query execution cost.
We present JoinGym, a query optimization environment for bushy reinforcement learning (RL)
Under the hood, JoinGym simulates a query plan's cost by looking up intermediate result cardinalities from a pre-computed dataset.
arXiv Detail & Related papers (2023-07-21T17:00:06Z) - Single-Stage Visual Relationship Learning using Conditional Queries [60.90880759475021]
TraCQ is a new formulation for scene graph generation that avoids the multi-task learning problem and the entity pair distribution.
We employ a DETR-based encoder-decoder conditional queries to significantly reduce the entity label space as well.
Experimental results show that TraCQ not only outperforms existing single-stage scene graph generation methods, it also beats many state-of-the-art two-stage methods on the Visual Genome dataset.
arXiv Detail & Related papers (2023-06-09T06:02:01Z) - BitE : Accelerating Learned Query Optimization in a Mixed-Workload
Environment [0.36700088931938835]
BitE is a novel ensemble learning model using database statistics and metadata to tune a learned query for enhancing performance.
Our model achieves 19.6% more improved queries and 15.8% less regressed queries compared to the existing traditional methods.
arXiv Detail & Related papers (2023-06-01T16:05:33Z) - Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing.
LLMs are extremely computationally expensive, even at inference time.
We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z) - AutoTransfer: AutoML with Knowledge Transfer -- An Application to Graph
Neural Networks [75.11008617118908]
AutoML techniques consider each task independently from scratch, leading to high computational cost.
Here we propose AutoTransfer, an AutoML solution that improves search efficiency by transferring the prior architectural design knowledge to the novel task of interest.
arXiv Detail & Related papers (2023-03-14T07:23:16Z) - Lero: A Learning-to-Rank Query Optimizer [49.841082217997354]
We introduce a learning to rank query, called Lero, which builds on top of the native query and continuously learns to improve query optimization.
Rather than building a learned from scratch, Lero is designed to leverage decades of wisdom of databases and improve the native.
Lero achieves near optimal performance on several benchmarks.
arXiv Detail & Related papers (2023-02-14T07:31:11Z) - Forecasting SQL Query Cost at Twitter [2.124552987084511]
Service employs machine learning techniques to train models from historical query request logs.
Models can achieve 97.9% accuracy for CPU usage prediction and 97% accuracy for memory usage prediction.
arXiv Detail & Related papers (2022-04-12T05:08:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.