DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning
over Tabular Data
- URL: http://arxiv.org/abs/2308.10915v1
- Date: Sun, 20 Aug 2023 23:40:26 GMT
- Title: DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning
over Tabular Data
- Authors: Peng Li, Zhiyi Chen, Xu Chu, Kexin Rong
- Abstract summary: We propose DiffPrep to automatically and efficiently search for a data preprocessing pipeline for a given dataset.
Our experiments show that DiffPrep achieves the best test accuracy on 15 out of the 18 real-world datasets evaluated.
- Score: 12.416345241511781
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data preprocessing is a crucial step in the machine learning process that
transforms raw data into a more usable format for downstream ML models.
However, it can be costly and time-consuming, often requiring the expertise of
domain experts. Existing automated machine learning (AutoML) frameworks claim
to automate data preprocessing. However, they often use a restricted search
space of data preprocessing pipelines which limits the potential performance
gains, and they are often too slow as they require training the ML model
multiple times. In this paper, we propose DiffPrep, a method that can
automatically and efficiently search for a data preprocessing pipeline for a
given tabular dataset and a differentiable ML model such that the performance
of the ML model is maximized. We formalize the problem of data preprocessing
pipeline search as a bi-level optimization problem. To solve this problem
efficiently, we transform and relax the discrete, non-differential search space
into a continuous and differentiable one, which allows us to perform the
pipeline search using gradient descent with training the ML model only once.
Our experiments show that DiffPrep achieves the best test accuracy on 15 out of
the 18 real-world datasets evaluated and improves the model's test accuracy by
up to 6.6 percentage points.
Related papers
- MUSO: Achieving Exact Machine Unlearning in Over-Parameterized Regimes [19.664090734076712]
Machine unlearning (MU) makes a well-trained model behave as if it had never been trained on specific data.
We propose an alternating optimization algorithm that unifies the tasks of unlearning and relabeling.
The algorithm's effectiveness, confirmed through numerical experiments, highlights its superior performance in unlearning across various scenarios.
arXiv Detail & Related papers (2024-10-11T06:17:17Z) - Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - Accelerated Cloud for Artificial Intelligence (ACAI) [24.40451195277244]
We propose an end-to-end cloud-based machine learning platform, Accelerated Cloud for AI (ACAI)
ACAI enables cloud-based storage of indexed, labeled, and searchable data, as well as automatic resource provisioning, job scheduling, and experiment tracking.
We show that our auto-provisioner produces a 1.7x speed-up and 39% cost reduction, and our system reduces experiment time for ML scientists by 20% on typical ML use cases.
arXiv Detail & Related papers (2024-01-30T07:09:48Z) - Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [115.501751261878]
Fine-tuning language models(LMs) on human-generated data remains a prevalent practice.
We investigate whether we can go beyond human data on tasks where we have access to scalar feedback.
We find that ReST$EM$ scales favorably with model size and significantly surpasses fine-tuning only on human data.
arXiv Detail & Related papers (2023-12-11T18:17:43Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - AutoSlicer: Scalable Automated Data Slicing for ML Model Analysis [3.3446830960153555]
We present Autoslicer, a scalable system that searches for problematic slices through distributed metric computation and hypothesis testing.
In the experiments, we show that our search strategy finds most of the anomalous slices by inspecting a small portion of the search space.
arXiv Detail & Related papers (2022-12-18T07:49:17Z) - Data Debugging with Shapley Importance over End-to-End Machine Learning
Pipelines [27.461398584509755]
DataScope is the first system that efficiently computes Shapley values of training examples over an end-to-end machine learning pipeline.
Our results show that DataScope is up to four orders of magnitude faster than state-of-the-art Monte Carlo-based methods.
arXiv Detail & Related papers (2022-04-23T19:29:23Z) - Complementary Ensemble Learning [1.90365714903665]
We derive a technique to improve performance of state-of-the-art deep learning models.
Specifically, we train auxiliary models which are able to complement state-of-the-art model uncertainty.
arXiv Detail & Related papers (2021-11-09T03:23:05Z) - AutoSimulate: (Quickly) Learning Synthetic Data Generation [70.82315853981838]
We propose an efficient alternative for optimal synthetic data generation based on a novel differentiable approximation of the objective.
We demonstrate that the proposed method finds the optimal data distribution faster (up to $50times$), with significantly reduced training data generation (up to $30times$) and better accuracy ($+8.7%$) on real-world test datasets than previous methods.
arXiv Detail & Related papers (2020-08-16T11:36:11Z) - Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach.
IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.