MLCask: Efficient Management of Component Evolution in Collaborative
Data Analytics Pipelines
- URL: http://arxiv.org/abs/2010.10246v4
- Date: Tue, 16 Mar 2021 12:54:40 GMT
- Title: MLCask: Efficient Management of Component Evolution in Collaborative
Data Analytics Pipelines
- Authors: Zhaojing Luo, Sai Ho Yeung, Meihui Zhang, Kaiping Zheng, Lei Zhu, Gang
Chen, Feiyi Fan, Qian Lin, Kee Yuan Ngiam, Beng Chin Ooi
- Abstract summary: We address two main challenges that arise during the deployment of machine learning pipelines, and address them with the design of versioning for an end-to-end analytics system MLCask.
We define and accelerate the metric-driven merge operation by pruning the pipeline search tree using reusable history records and pipeline compatibility information.
The effectiveness of MLCask is evaluated through an extensive study over several real-world deployment cases.
- Score: 29.999324319722508
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the ever-increasing adoption of machine learning for data analytics,
maintaining a machine learning pipeline is becoming more complex as both the
datasets and trained models evolve with time. In a collaborative environment,
the changes and updates due to pipeline evolution often cause cumbersome
coordination and maintenance work, raising the costs and making it hard to use.
Existing solutions, unfortunately, do not address the version evolution
problem, especially in a collaborative environment where non-linear version
control semantics are necessary to isolate operations made by different user
roles. The lack of version control semantics also incurs unnecessary storage
consumption and lowers efficiency due to data duplication and repeated data
pre-processing, which are avoidable. In this paper, we identify two main
challenges that arise during the deployment of machine learning pipelines, and
address them with the design of versioning for an end-to-end analytics system
MLCask. The system supports multiple user roles with the ability to perform
Git-like branching and merging operations in the context of the machine
learning pipelines. We define and accelerate the metric-driven merge operation
by pruning the pipeline search tree using reusable history records and pipeline
compatibility information. Further, we design and implement the prioritized
pipeline search, which gives preference to the pipelines that probably yield
better performance. The effectiveness of MLCask is evaluated through an
extensive study over several real-world deployment cases. The performance
evaluation shows that the proposed merge operation is up to 7.8x faster and
saves up to 11.9x storage space than the baseline method that does not utilize
history records.
Related papers
- ToolACE: Winning the Points of LLM Function Calling [139.07157814653638]
ToolACE is an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data.
We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard.
arXiv Detail & Related papers (2024-09-02T03:19:56Z) - Instrumentation and Analysis of Native ML Pipelines via Logical Query Plans [3.2362171533623054]
We envision highly-automated software platforms to assist data scientists with developing, validating, monitoring, and analysing their Machine Learning pipelines.
We extract "logical query plans" from ML pipeline code relying on popular libraries.
Based on these plans, we automatically infer pipeline semantics and instrument and rewrite the ML pipelines to enable diverse use cases without requiring data scientists to manually annotate or rewrite their code.
arXiv Detail & Related papers (2024-07-10T11:35:02Z) - Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie [5.259526087073711]
We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie.
We demonstrate its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline with a few CLI commands.
arXiv Detail & Related papers (2024-04-21T14:53:33Z) - List-aware Reranking-Truncation Joint Model for Search and
Retrieval-augmented Generation [80.12531449946655]
We propose a Reranking-Truncation joint model (GenRT) that can perform the two tasks concurrently.
GenRT integrates reranking and truncation via generative paradigm based on encoder-decoder architecture.
Our method achieves SOTA performance on both reranking and truncation tasks for web search and retrieval-augmented LLMs.
arXiv Detail & Related papers (2024-02-05T06:52:53Z) - Trusted Provenance of Automated, Collaborative and Adaptive Data Processing Pipelines [2.186901738997927]
We provide a solution architecture and a proof of concept implementation of a service, called Provenance Holder.
Provenance Holder enables provenance of collaborative, adaptive data processing pipelines in a trusted manner.
arXiv Detail & Related papers (2023-10-17T17:52:27Z) - Deep Pipeline Embeddings for AutoML [11.168121941015015]
AutoML is a promising direction for democratizing AI by automatically deploying Machine Learning systems with minimal human expertise.
Existing Pipeline Optimization techniques fail to explore deep interactions between pipeline stages/components.
This paper proposes a novel neural architecture that captures the deep interaction between the components of a Machine Learning pipeline.
arXiv Detail & Related papers (2023-05-23T12:40:38Z) - Towards Personalized Preprocessing Pipeline Search [52.59156206880384]
ClusterP3S is a novel framework for Personalized Preprocessing Pipeline Search via Clustering.
We propose a hierarchical search strategy to jointly learn the clusters and search for the optimal pipelines.
Experiments on benchmark classification datasets demonstrate the effectiveness of enabling feature-wise preprocessing pipeline search.
arXiv Detail & Related papers (2023-02-28T05:45:05Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Automated Evolutionary Approach for the Design of Composite Machine
Learning Pipelines [48.7576911714538]
The proposed approach is aimed to automate the design of composite machine learning pipelines.
It designs the pipelines with a customizable graph-based structure, analyzes the obtained results, and reproduces them.
The software implementation on this approach is presented as an open-source framework.
arXiv Detail & Related papers (2021-06-26T23:19:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.