Related papers: VeML: An End-to-End Machine Learning Lifecycle for Large-scale and High-dimensional Data

VeML: An End-to-End Machine Learning Lifecycle for Large-scale and High-dimensional Data

URL: http://arxiv.org/abs/2304.13037v2
Date: Thu, 27 Jul 2023 06:09:18 GMT
Title: VeML: An End-to-End Machine Learning Lifecycle for Large-scale and High-dimensional Data
Authors: Van-Duc Le, Cuong-Tien Bui, Wen-Syan Li
Abstract summary: This paper introduces VeML, a Version management system dedicated to end-to-end machine learning lifecycle. We address the high cost of building an ML lifecycle, especially for large-scale and high-dimensional dataset. We design an algorithm based on the core set to compute similarity for large-scale, high-dimensional data efficiently.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: An end-to-end machine learning (ML) lifecycle consists of many iterative processes, from data preparation and ML model design to model training and then deploying the trained model for inference. When building an end-to-end lifecycle for an ML problem, many ML pipelines must be designed and executed that produce a huge number of lifecycle versions. Therefore, this paper introduces VeML, a Version management system dedicated to end-to-end ML Lifecycle. Our system tackles several crucial problems that other systems have not solved. First, we address the high cost of building an ML lifecycle, especially for large-scale and high-dimensional dataset. We solve this problem by proposing to transfer the lifecycle of similar datasets managed in our system to the new training data. We design an algorithm based on the core set to compute similarity for large-scale, high-dimensional data efficiently. Another critical issue is the model accuracy degradation by the difference between training data and testing data during the ML lifetime, which leads to lifecycle rebuild. Our system helps to detect this mismatch without getting labeled data from testing data and rebuild the ML lifecycle for a new data version. To demonstrate our contributions, we conduct experiments on real-world, large-scale datasets of driving images and spatiotemporal sensor data and show promising results.

Related papers

From Data to Decision: Data-Centric Infrastructure for Reproducible ML in Collaborative eScience [1.136688282190268]
Reproducibility remains a central challenge in machine learning (ML)<n>Current ML are often fragmented, relying on informal data sharing, ad hoc scripts, and loosely connected tools.<n>This paper introduces a data-centric framework for lifecycle-aware artifacts.
arXiv Detail & Related papers (2025-06-19T06:09:01Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore [85.4310806466002]
We find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation. By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget.
arXiv Detail & Related papers (2024-07-09T08:27:27Z)
MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results. For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data. For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
In Situ Framework for Coupling Simulation and Machine Learning with Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations. As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks. This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z)
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)
SimbaML: Connecting Mechanistic Models and Machine Learning with Augmented Data [0.0]
SimbaML is an open-source tool that unifies realistic synthetic dataset generation from ordinary differential equation-based models. SimbaML conveniently enables investigating transfer learning from synthetic to real-world data.
arXiv Detail & Related papers (2023-04-08T12:50:50Z)
REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines [0.0]
We introduce a benchmark, called REIN1, to thoroughly investigate the impact of data cleaning methods on various machine learning models. Through the benchmark, we provide answers to important research questions, e.g., where and whether data cleaning is a necessary step in ML pipelines.
arXiv Detail & Related papers (2023-02-09T15:37:39Z)
Designing Data: Proactive Data Collection and Iteration for Machine Learning [12.295169687537395]
Lack of diversity in data collection has caused significant failures in machine learning (ML) applications. New methods to track & manage data collection, iteration, and model training are necessary for evaluating whether datasets reflect real world variability.
arXiv Detail & Related papers (2023-01-24T21:40:29Z)
Data Debugging with Shapley Importance over End-to-End Machine Learning Pipelines [27.461398584509755]
DataScope is the first system that efficiently computes Shapley values of training examples over an end-to-end machine learning pipeline. Our results show that DataScope is up to four orders of magnitude faster than state-of-the-art Monte Carlo-based methods.
arXiv Detail & Related papers (2022-04-23T19:29:23Z)
Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods. We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments. The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.