Efficient and Accurate In-Database Machine Learning with SQL Code
Generation in Python
- URL: http://arxiv.org/abs/2104.03224v1
- Date: Wed, 7 Apr 2021 16:23:19 GMT
- Title: Efficient and Accurate In-Database Machine Learning with SQL Code
Generation in Python
- Authors: Michael Kaufmann, Gabriel Stechschulte, Anna Huber
- Abstract summary: We describe a novel method for In-Database Machine Learning (IDBML) in Python using template macros in Jinja2.
Our method was 2-3% less accurate than the best current state-of-the-art methods we found (decision trees and random forests) and 2-3 times slower for one in-memory dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Following an analysis of the advantages of SQL-based Machine Learning (ML)
and a short literature survey of the field, we describe a novel method for
In-Database Machine Learning (IDBML). We contribute a process for SQL-code
generation in Python using template macros in Jinja2 as well as the prototype
implementation of the process. We describe our implementation of the process to
compute multidimensional histogram (MDH) probability estimation in SQL. For
this, we contribute and implement a novel discretization method called equal
quantized rank (EQR) variable-width binning. Based on this, we provide data
gathered in a benchmarking experiment for the quantitative empirical evaluation
of our method and system using the Covertype dataset. We measured accuracy and
computation time. Our multidimensional probability estimation was significantly
more accurate than Naive Bayes, which assumes independent one-dimensional
probabilities and/or densities. Also, our method was significantly more
accurate and faster than logistic regression. However, our method was 2-3% less
accurate than the best current state-of-the-art methods we found (decision
trees and random forests) and 2-3 times slower for one in-memory dataset. Yet,
this fact motivates for further research in accuracy improvement and in IDBML
with SQL code generation for big data and larger-than-memory datasets.
Related papers
- In-Database Data Imputation [0.6157028677798809]
Missing data is a widespread problem in many domains, creating challenges in data analysis and decision making.
Traditional techniques for dealing with missing data, such as excluding incomplete records or imputing simple estimates, are computationally efficient but may introduce bias and disrupt variable relationships.
Model-based imputation techniques offer a more robust solution that preserves the variability and relationships in the data, but they demand significantly more computation time.
This work enables efficient, high-quality, and scalable data imputation within a database system using the widely used MICE method.
arXiv Detail & Related papers (2024-01-07T01:57:41Z) - Optimal Data Generation in Multi-Dimensional Parameter Spaces, using
Bayesian Optimization [0.0]
We propose a novel approach for constructing a minimal yet highly informative database for training machine learning models.
We mimic the underlying relation between the output and input parameters using Gaussian process regression (GPR)
Given the predicted standard deviation by GPR, we select data points using Bayesian optimization to obtain an efficient database for training ML models.
arXiv Detail & Related papers (2023-12-04T16:36:29Z) - A Semiparametric Efficient Approach To Label Shift Estimation and
Quantification [0.0]
We present a new procedure called SELSE which estimates the shift in the response variable's distribution.
We prove that SELSE's normalized error has the smallest possible variance matrix compared to any other algorithm in that family.
arXiv Detail & Related papers (2022-11-07T07:49:29Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - Memory-Based Optimization Methods for Model-Agnostic Meta-Learning and
Personalized Federated Learning [56.17603785248675]
Model-agnostic meta-learning (MAML) has become a popular research area.
Existing MAML algorithms rely on the episode' idea by sampling a few tasks and data points to update the meta-model at each iteration.
This paper proposes memory-based algorithms for MAML that converge with vanishing error.
arXiv Detail & Related papers (2021-06-09T08:47:58Z) - Probabilistic Case-based Reasoning for Open-World Knowledge Graph
Completion [59.549664231655726]
A case-based reasoning (CBR) system solves a new problem by retrieving cases' that are similar to the given problem.
In this paper, we demonstrate that such a system is achievable for reasoning in knowledge-bases (KBs)
Our approach predicts attributes for an entity by gathering reasoning paths from similar entities in the KB.
arXiv Detail & Related papers (2020-10-07T17:48:12Z) - Real-Time Regression with Dividing Local Gaussian Processes [62.01822866877782]
Local Gaussian processes are a novel, computationally efficient modeling approach based on Gaussian process regression.
Due to an iterative, data-driven division of the input space, they achieve a sublinear computational complexity in the total number of training points in practice.
A numerical evaluation on real-world data sets shows their advantages over other state-of-the-art methods in terms of accuracy as well as prediction and update speed.
arXiv Detail & Related papers (2020-06-16T18:43:31Z) - Monte Carlo simulation studies on Python using the sstudy package with
SQL databases as storage [0.0]
sstudy is a Python package designed to simplify the preparation of simulation studies.
We present a short statistical description of the simulation study procedure with a simplified explanation of what is being estimated.
arXiv Detail & Related papers (2020-04-27T20:49:43Z) - Monotonic Cardinality Estimation of Similarity Selection: A Deep
Learning Approach [22.958342743597044]
We investigate the possibilities of utilizing deep learning for cardinality estimation of similarity selection.
We propose a novel and generic method that can be applied to any data type and distance function.
arXiv Detail & Related papers (2020-02-15T20:22:51Z) - Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach.
IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.