Data Pipeline Training: Integrating AutoML to Optimize the Data Flow of
Machine Learning Models
- URL: http://arxiv.org/abs/2402.12916v1
- Date: Tue, 20 Feb 2024 11:06:42 GMT
- Title: Data Pipeline Training: Integrating AutoML to Optimize the Data Flow of
Machine Learning Models
- Authors: Jiang Wu, Hongbo Wang, Chunhe Ni, Chenwei Zhang, Wenran Lu
- Abstract summary: Data Pipeline plays an indispensable role in tasks such as modeling machine learning and developing data products.
This paper focuses on exploring how to optimize data flow through automated machine learning methods.
We will discuss how to leverage AutoML technology to enhance the intelligence of Data Pipeline.
- Score: 17.091169031023714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data Pipeline plays an indispensable role in tasks such as modeling machine
learning and developing data products. With the increasing diversification and
complexity of Data sources, as well as the rapid growth of data volumes,
building an efficient Data Pipeline has become crucial for improving work
efficiency and solving complex problems. This paper focuses on exploring how to
optimize data flow through automated machine learning methods by integrating
AutoML with Data Pipeline. We will discuss how to leverage AutoML technology to
enhance the intelligence of Data Pipeline, thereby achieving better results in
machine learning tasks. By delving into the automation and optimization of Data
flows, we uncover key strategies for constructing efficient data pipelines that
can adapt to the ever-changing data landscape. This not only accelerates the
modeling process but also provides innovative solutions to complex problems,
enabling more significant outcomes in increasingly intricate data domains.
Keywords- Data Pipeline Training;AutoML; Data environment; Machine learning
Related papers
- AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML [56.565200973244146]
Automated machine learning (AutoML) accelerates AI development by automating tasks in the development pipeline.
Recent works have started exploiting large language models (LLM) to lessen such burden.
This paper proposes AutoML-Agent, a novel multi-agent framework tailored for full-pipeline AutoML.
arXiv Detail & Related papers (2024-10-03T20:01:09Z) - Automated data processing and feature engineering for deep learning and big data applications: a survey [0.0]
Modern approach to artificial intelligence (AI) aims to design algorithms that learn directly from data.
Not all data processing tasks in conventional deep learning pipelines have been automated.
arXiv Detail & Related papers (2024-03-18T01:07:48Z) - Unlearn What You Want to Forget: Efficient Unlearning for LLMs [92.51670143929056]
Large language models (LLMs) have achieved significant progress from pre-training on and memorizing a wide range of textual data.
This process might suffer from privacy issues and violations of data protection regulations.
We propose an efficient unlearning framework that could efficiently update LLMs without having to retrain the whole model after data removals.
arXiv Detail & Related papers (2023-10-31T03:35:59Z) - AutoCure: Automated Tabular Data Curation Technique for ML Pipelines [0.0]
We present AutoCure, a novel and configuration-free data curation pipeline.
Unlike traditional data curation methods, AutoCure synthetically enhances the density of the clean data fraction.
In practice, AutoCure can be integrated with open source tools to promote the democratization of machine learning.
arXiv Detail & Related papers (2023-04-26T15:51:47Z) - AutoEn: An AutoML method based on ensembles of predefined Machine
Learning pipelines for supervised Traffic Forecasting [1.6242924916178283]
Traffic Forecasting (TF) is gaining relevance due to its ability to mitigate traffic congestion by forecasting future traffic states.
TF poses one big challenge to the Machine Learning paradigm, known as the Model Selection Problem (MSP)
We introduce AutoEn, which is a simple and efficient method for automatically generating multi-classifier ensembles from a predefined set of ML pipelines.
arXiv Detail & Related papers (2023-03-19T18:37:18Z) - Privacy-Preserving Machine Learning for Collaborative Data Sharing via
Auto-encoder Latent Space Embeddings [57.45332961252628]
Privacy-preserving machine learning in data-sharing processes is an ever-critical task.
This paper presents an innovative framework that uses Representation Learning via autoencoders to generate privacy-preserving embedded data.
arXiv Detail & Related papers (2022-11-10T17:36:58Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Automated Machine Learning Techniques for Data Streams [91.3755431537592]
This paper surveys the state-of-the-art open-source AutoML tools, applies them to data collected from streams, and measures how their performance changes over time.
The results show that off-the-shelf AutoML tools can provide satisfactory results but in the presence of concept drift, detection or adaptation techniques have to be applied to maintain the predictive accuracy over time.
arXiv Detail & Related papers (2021-06-14T11:42:46Z) - tf.data: A Machine Learning Data Processing Framework [0.4588028371034406]
Training machine learning models requires feeding input data for models to ingest.
We present tf.data, a framework for building and executing efficient input pipelines for machine learning jobs.
We demonstrate that input pipeline performance is critical to the end-to-end training time of state-of-the-art machine learning models.
arXiv Detail & Related papers (2021-01-28T17:16:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.