Multi-layer Optimizations for End-to-End Data Analytics
- URL: http://arxiv.org/abs/2001.03541v1
- Date: Fri, 10 Jan 2020 16:14:44 GMT
- Title: Multi-layer Optimizations for End-to-End Data Analytics
- Authors: Amir Shaikhha, Maximilian Schleich, Alexandru Ghita, Dan Olteanu
- Abstract summary: We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach.
IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
- Score: 71.05611866288196
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of training machine learning models over
multi-relational data. The mainstream approach is to first construct the
training dataset using a feature extraction query over input database and then
use a statistical software package of choice to train the model. In this paper
we introduce Iterative Functional Aggregate Queries (IFAQ), a framework that
realizes an alternative approach. IFAQ treats the feature extraction query and
the learning task as one program given in the IFAQ's domain-specific language,
which captures a subset of Python commonly used in Jupyter notebooks for rapid
prototyping of machine learning applications. The program is subject to several
layers of IFAQ optimizations, such as algebraic transformations, loop
transformations, schema specialization, data layout optimizations, and finally
compilation into efficient low-level C++ code specialized for the given
workload and data.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit,
and TensorFlow by several orders of magnitude for linear regression and
regression tree models over several relational datasets.
Related papers
- Learning to Retrieve Iteratively for In-Context Learning [56.40100968649039]
iterative retrieval is a novel framework that empowers retrievers to make iterative decisions through policy optimization.
We instantiate an iterative retriever for composing in-context learning exemplars and apply it to various semantic parsing tasks.
By adding only 4M additional parameters for state encoding, we convert an off-the-shelf dense retriever into a stateful iterative retriever.
arXiv Detail & Related papers (2024-06-20T21:07:55Z) - FissionFusion: Fast Geometric Generation and Hierarchical Souping for Medical Image Analysis [0.7751705157998379]
The scarcity of well-annotated medical datasets requires leveraging transfer learning from broader datasets like ImageNet or pre-trained models like CLIP.
Model soups averages multiple fine-tuned models aiming to improve performance on In-Domain (ID) tasks and enhance robustness against Out-of-Distribution (OOD) datasets.
We propose a hierarchical merging approach that involves local and global aggregation of models at various levels.
arXiv Detail & Related papers (2024-03-20T06:48:48Z) - Selecting Walk Schemes for Database Embedding [6.7609045625714925]
We study the embedding of components of a relational database.
We focus on the recent FoRWaRD algorithm that is designed for dynamic databases.
We show that by focusing on a few informative walk schemes, we can obtain embedding significantly faster, while retaining the quality.
arXiv Detail & Related papers (2024-01-20T11:39:32Z) - Optimal Data Generation in Multi-Dimensional Parameter Spaces, using
Bayesian Optimization [0.0]
We propose a novel approach for constructing a minimal yet highly informative database for training machine learning models.
We mimic the underlying relation between the output and input parameters using Gaussian process regression (GPR)
Given the predicted standard deviation by GPR, we select data points using Bayesian optimization to obtain an efficient database for training ML models.
arXiv Detail & Related papers (2023-12-04T16:36:29Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - Optimally Weighted Ensembles of Regression Models: Exact Weight
Optimization and Applications [0.0]
We show that combining different regression models can yield better results than selecting a single ('best') regression model.
We outline an efficient method that obtains optimally weighted linear combination from a heterogeneous set of regression models.
arXiv Detail & Related papers (2022-06-22T09:11:14Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Parameter-Efficient Abstractive Question Answering over Tables or Text [60.86457030988444]
A long-term ambition of information seeking QA systems is to reason over multi-modal contexts and generate natural answers to user queries.
Memory intensive pre-trained language models are adapted to downstream tasks such as QA by fine-tuning the model on QA data in a specific modality like unstructured text or structured tables.
To avoid training such memory-hungry models while utilizing a uniform architecture for each modality, parameter-efficient adapters add and train small task-specific bottle-neck layers between transformer layers.
arXiv Detail & Related papers (2022-04-07T10:56:29Z) - Efficient Data-specific Model Search for Collaborative Filtering [56.60519991956558]
Collaborative filtering (CF) is a fundamental approach for recommender systems.
In this paper, motivated by the recent advances in automated machine learning (AutoML), we propose to design a data-specific CF model.
Key here is a new framework that unifies state-of-the-art (SOTA) CF methods and splits them into disjoint stages of input encoding, embedding function, interaction and prediction function.
arXiv Detail & Related papers (2021-06-14T14:30:32Z) - StackGenVis: Alignment of Data, Algorithms, and Models for Stacking Ensemble Learning Using Performance Metrics [4.237343083490243]
In machine learning (ML), ensemble methods such as bagging, boosting, and stacking are widely-established approaches.
StackGenVis is a visual analytics system for stacked generalization.
arXiv Detail & Related papers (2020-05-04T15:43:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.