Related papers: Forecasting SQL Query Cost at Twitter

Forecasting SQL Query Cost at Twitter

URL: http://arxiv.org/abs/2204.05529v1
Date: Tue, 12 Apr 2022 05:08:30 GMT
Title: Forecasting SQL Query Cost at Twitter
Authors: Chunxu Tang, Beinan Wang, Zhenxiao Luo, Huijun Wu, Shajan Dasan, Maosong Fu, Yao Li, Mainak Ghosh, Ruchin Kabra, Nikhil Kantibhai Navadiya, Da Cheng, Fred Dai, Vrushali Channapattan, and Prachi Mishra
Abstract summary: Service employs machine learning techniques to train models from historical query request logs. Models can achieve 97.9% accuracy for CPU usage prediction and 97% accuracy for memory usage prediction.
Score: 2.124552987084511
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the advent of the Big Data era, it is usually computationally expensive to calculate the resource usages of a SQL query with traditional DBMS approaches. Can we estimate the cost of each query more efficiently without any computation in a SQL engine kernel? Can machine learning techniques help to estimate SQL query resource utilization? The answers are yes. We propose a SQL query cost predictor service, which employs machine learning techniques to train models from historical query request logs and rapidly forecasts the CPU and memory resource usages of online queries without any computation in a SQL engine. At Twitter, infrastructure engineers are maintaining a large-scale SQL federation system across on-premises and cloud data centers for serving ad-hoc queries. The proposed service can help to improve query scheduling by relieving the issue of imbalanced online analytical processing (OLAP) workloads in the SQL engine clusters. It can also assist in enabling preemptive scaling. Additionally, the proposed approach uses plain SQL statements for the model training and online prediction, indicating it is both hardware and software-agnostic. The method can be generalized to broader SQL systems and heterogeneous environments. The models can achieve 97.9\% accuracy for CPU usage prediction and 97\% accuracy for memory usage prediction.

Related papers

Improving DBMS Scheduling Decisions with Fine-grained Performance Prediction on Concurrent Queries -- Extended [15.354441937462271]
This work introduces IconqSched, a new, principled non-intrusive scheduler that optimize execution order and timing of queries. IconqSched features a novel fine-grained predictor, Iconq, which treats the system runtime as a black box. We compare IconqSched to other schedulers in terms of end-to-end runtime using real workload traces.
arXiv Detail & Related papers (2025-01-27T17:55:39Z)
PixelsDB: Serverless and Natural-Language-Aided Data Analytics with Flexible Service Levels and Prices [16.104672530595483]
PixelsDB is an open-source data analytic system that allows users to explore data efficiently. It allows users to generate and debugsql queries using a natural language interface powered by fine-tuned language models. The queries are then executed by a serverless query engine that offers varying prices for different service levels on query urgency.
arXiv Detail & Related papers (2024-05-30T07:48:43Z)
FeatAug: Automatic Feature Augmentation From One-to-Many Relationship Tables [4.058220332950672]
Feature augmentation from one-to-many relationship tables is a critical but challenging problem in ML model development. We propose FEATAUG, a new feature augmentation framework that automatically extracts predicate-aware queries from one-to-many relationship tables. Our experiments on four real-world datasets demonstrate that FeatAug extracts more effective features compared to Featuretools.
arXiv Detail & Related papers (2024-03-11T01:44:14Z)
SQLPrompt: In-Context Text-to-SQL with Minimal Labeled Data [54.69489315952524]
"Prompt" is designed to improve the few-shot prompting capabilities of Text-to-LLMs. "Prompt" outperforms previous approaches for in-context learning with few labeled data by a large margin. We show that emphPrompt outperforms previous approaches for in-context learning with few labeled data by a large margin.
arXiv Detail & Related papers (2023-11-06T05:24:06Z)
JoinGym: An Efficient Query Optimization Environment for Reinforcement Learning [58.71541261221863]
Join order selection (JOS) is the problem of ordering join operations to minimize total query execution cost. We present JoinGym, a query optimization environment for bushy reinforcement learning (RL) Under the hood, JoinGym simulates a query plan's cost by looking up intermediate result cardinalities from a pre-computed dataset.
arXiv Detail & Related papers (2023-07-21T17:00:06Z)
SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended) [53.95151604061761]
This paper introduces the framework for enhancing Text-to- filtering using large language models (LLMs) With few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error analyses. With instruction fine-tuning, we delve deep in understanding the critical paradigms that influence the performance of tuned LLMs.
arXiv Detail & Related papers (2023-05-26T21:39:05Z)
Wav2SQL: Direct Generalizable Speech-To-SQL Parsing [55.10009651476589]
Speech-to-Spider (S2Spider) aims to convert spoken questions intosql queries given databases. We propose the first direct speech-to-speaker parsing model Wav2 which avoids error compounding across cascaded systems. Experimental results demonstrate that Wav2 avoids error compounding and achieves state-of-the-art results by up to 2.5% accuracy improvement over the baseline.
arXiv Detail & Related papers (2023-05-21T19:26:46Z)
Weakly Supervised Text-to-SQL Parsing through Question Decomposition [53.22128541030441]
We take advantage of the recently proposed question meaning representation called QDMR. Given questions, their QDMR structures (annotated by non-experts or automatically predicted) and the answers, we are able to automatically synthesizesql queries. Our results show that the weakly supervised models perform competitively with those trained on NL- benchmark data.
arXiv Detail & Related papers (2021-12-12T20:02:42Z)
Learning GraphQL Query Costs (Extended Version) [7.899264246319001]
We propose a machine-learning approach to efficiently and accurately estimate the query cost. Our framework is efficient and predicts query costs with high accuracy, consistently outperforming the static analysis by a large margin.
arXiv Detail & Related papers (2021-08-25T09:18:31Z)
Dual Reader-Parser on Hybrid Textual and Tabular Evidence for Open Domain Question Answering [78.9863753810787]
A large amount of world's knowledge is stored in structured databases. query languages can answer questions that require complex reasoning, as well as offering full explainability.
arXiv Detail & Related papers (2021-08-05T22:04:13Z)
Efficient Deep Learning Pipelines for Accurate Cost Estimations Over Large Scale Query Workload [25.52190205651031]
We develop a tree convolution based data science pipeline that accurately predicts resource consumption patterns of query traces. We evaluate our pipeline over 19K Presto OLAP queries from Grab, on a data lake of more than 20PB of data. We demonstrate direct cost savings of up to 13.2x for large batched model training over Microsoft Azure.
arXiv Detail & Related papers (2021-03-23T11:36:10Z)
Approximating Aggregated SQL Queries With LSTM Networks [31.528524004435933]
We present a method for query approximation, also known as approximate query processing (AQP) We use LSTM network to learn the relationship between queries and their results, and to provide a rapid inference layer for predicting query results. Our method was able to predict up to 120,000 queries in a second, and with a single query latency of no more than 2ms.
arXiv Detail & Related papers (2020-10-25T16:17:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.