Bootstrapping Learned Cost Models with Synthetic SQL Queries
- URL: http://arxiv.org/abs/2508.19807v1
- Date: Wed, 27 Aug 2025 11:50:42 GMT
- Title: Bootstrapping Learned Cost Models with Synthetic SQL Queries
- Authors: Michael Nidd, Christoph Miksovic, Thomas Gschwind, Francesco Fusco, Andrea Giovannini, Ioana Giurgiu,
- Abstract summary: Recent advances in learned cost models have shown that one can effectively and efficiently predict the cost of running a given query against a specific database engine.<n>We show that we can improve a learned cost model's predictive accuracy by training it with 45% fewer queries than when using competitive generation approaches.
- Score: 5.358221160521712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Having access to realistic workloads for a given database instance is extremely important to enable stress and vulnerability testing, as well as to optimize for cost and performance. Recent advances in learned cost models have shown that when enough diverse SQL queries are available, one can effectively and efficiently predict the cost of running a given query against a specific database engine. In this paper, we describe our experience in exploiting modern synthetic data generation techniques, inspired by the generative AI and LLM community, to create high-quality datasets enabling the effective training of such learned cost models. Initial results show that we can improve a learned cost model's predictive accuracy by training it with 45% fewer queries than when using competitive generation approaches.
Related papers
- Making Databases Faster with LLM Evolutionary Sampling [27.62392938968789]
Traditional query optimization relies on cost-based models that estimate execution cost.<n>We use our DBPlan harness for the DataFusion engine to propose localized edits that can be applied and executed.<n>We then apply an evolutionary search over these edits to refine candidates across iterations.<n>We obtain up to 4.78$times$ speedups on some queries and we demonstrate a small-to-large workflow.
arXiv Detail & Related papers (2026-02-11T00:21:51Z) - Thinking Augmented Pre-training [88.04395622064708]
Thinking augmented Pre-Training is a universal methodology that augments text with automatically generated thinking trajectories.<n>This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories.
arXiv Detail & Related papers (2025-09-24T14:45:13Z) - SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z) - FASTGEN: Fast and Cost-Effective Synthetic Tabular Data Generation with LLMs [3.703188184729035]
Synthetic data generation is an invaluable solution in scenarios where real-world data collection and usage are limited by cost and scarcity.<n>Existing approaches that directly use large language models to generate each record individually impose prohibitive time and cost burdens.<n>We propose a fast, cost-effective method for realistic tabular data synthesis that leverages LLMs to infer and encode each field's distribution into a reusable sampling script.
arXiv Detail & Related papers (2025-07-21T17:51:46Z) - Reqo: A Robust and Explainable Query Optimization Cost Model [2.184775414778289]
We propose a tree model architecture based on Bidirectional Graph Neural Networks (Bi-GNN) aggregated by Gated Recurrent Units (GRUs)<n>We implement a novel learning-to-rank cost model that effectively quantifies the uncertainty in cost estimates using approximate probabilistic ML.<n>In addition, we propose the first explainability technique specifically designed for learning-based cost models.
arXiv Detail & Related papers (2025-01-29T04:48:51Z) - Self-Refinement Strategies for LLM-based Product Attribute Value Extraction [51.45146101802871]
This paper investigates applying two self-refinement techniques to the product attribute value extraction task.<n>The experiments show that both self-refinement techniques fail to significantly improve the extraction performance while substantially increasing processing costs.<n>For scenarios with development data, fine-tuning yields the highest performance, while the ramp-up costs of fine-tuning are balanced out as the amount of product descriptions increases.
arXiv Detail & Related papers (2025-01-02T12:55:27Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and
Training Efficiency via Efficient Data Sampling and Routing [57.86954315102865]
DeepSpeed Data Efficiency is a framework that makes better use of data, increases training efficiency, and improves model quality.
For GPT-3 1.3B language model pretraining, our work achieves 12.5x less data/time/cost, while still maintaining 95% of model quality compared to baseline with full data and cost.
For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost.
arXiv Detail & Related papers (2022-12-07T12:27:28Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - SHiFT: An Efficient, Flexible Search Engine for Transfer Learning [16.289623977712086]
Transfer learning can be seen as a data- and compute-efficient alternative to training models from scratch.
We propose SHiFT, the first downstream task-aware, flexible, and efficient model search engine for transfer learning.
arXiv Detail & Related papers (2022-04-04T13:16:46Z) - Zero-Shot Cost Models for Out-of-the-box Learned Cost Prediction [18.46293613612346]
We introduce zero-shot cost models which enable learned cost estimation that generalizes to unseen databases.
We suggest a new learning paradigm based on pre-trained cost models.
We show that zero-shot cost models can be used in a few-shot mode that further improves their quality by retraining them just with a small number of additional training queries on the unseen database.
arXiv Detail & Related papers (2022-01-03T10:11:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.