VeML: An End-to-End Machine Learning Lifecycle for Large-scale and
High-dimensional Data
- URL: http://arxiv.org/abs/2304.13037v2
- Date: Thu, 27 Jul 2023 06:09:18 GMT
- Title: VeML: An End-to-End Machine Learning Lifecycle for Large-scale and
High-dimensional Data
- Authors: Van-Duc Le, Cuong-Tien Bui, Wen-Syan Li
- Abstract summary: This paper introduces VeML, a Version management system dedicated to end-to-end machine learning lifecycle.
We address the high cost of building an ML lifecycle, especially for large-scale and high-dimensional dataset.
We design an algorithm based on the core set to compute similarity for large-scale, high-dimensional data efficiently.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: An end-to-end machine learning (ML) lifecycle consists of many iterative
processes, from data preparation and ML model design to model training and then
deploying the trained model for inference. When building an end-to-end
lifecycle for an ML problem, many ML pipelines must be designed and executed
that produce a huge number of lifecycle versions. Therefore, this paper
introduces VeML, a Version management system dedicated to end-to-end ML
Lifecycle. Our system tackles several crucial problems that other systems have
not solved. First, we address the high cost of building an ML lifecycle,
especially for large-scale and high-dimensional dataset. We solve this problem
by proposing to transfer the lifecycle of similar datasets managed in our
system to the new training data. We design an algorithm based on the core set
to compute similarity for large-scale, high-dimensional data efficiently.
Another critical issue is the model accuracy degradation by the difference
between training data and testing data during the ML lifetime, which leads to
lifecycle rebuild. Our system helps to detect this mismatch without getting
labeled data from testing data and rebuild the ML lifecycle for a new data
version. To demonstrate our contributions, we conduct experiments on
real-world, large-scale datasets of driving images and spatiotemporal sensor
data and show promising results.
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Scaling Retrieval-Based Language Models with a Trillion-Token Datastore [85.4310806466002]
We find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation.
By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget.
arXiv Detail & Related papers (2024-07-09T08:27:27Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - SimbaML: Connecting Mechanistic Models and Machine Learning with
Augmented Data [0.0]
SimbaML is an open-source tool that unifies realistic synthetic dataset generation from ordinary differential equation-based models.
SimbaML conveniently enables investigating transfer learning from synthetic to real-world data.
arXiv Detail & Related papers (2023-04-08T12:50:50Z) - REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in
ML Pipelines [0.0]
We introduce a benchmark, called REIN1, to thoroughly investigate the impact of data cleaning methods on various machine learning models.
Through the benchmark, we provide answers to important research questions, e.g., where and whether data cleaning is a necessary step in ML pipelines.
arXiv Detail & Related papers (2023-02-09T15:37:39Z) - Designing Data: Proactive Data Collection and Iteration for Machine
Learning [12.295169687537395]
Lack of diversity in data collection has caused significant failures in machine learning (ML) applications.
New methods to track & manage data collection, iteration, and model training are necessary for evaluating whether datasets reflect real world variability.
arXiv Detail & Related papers (2023-01-24T21:40:29Z) - Data Debugging with Shapley Importance over End-to-End Machine Learning
Pipelines [27.461398584509755]
DataScope is the first system that efficiently computes Shapley values of training examples over an end-to-end machine learning pipeline.
Our results show that DataScope is up to four orders of magnitude faster than state-of-the-art Monte Carlo-based methods.
arXiv Detail & Related papers (2022-04-23T19:29:23Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.