Production Machine Learning Pipelines: Empirical Analysis and
Optimization Opportunities
- URL: http://arxiv.org/abs/2103.16007v1
- Date: Tue, 30 Mar 2021 00:46:29 GMT
- Title: Production Machine Learning Pipelines: Empirical Analysis and
Optimization Opportunities
- Authors: Doris Xin, Hui Miao, Aditya Parameswaran, Neoklis Polyzotis
- Abstract summary: We analyze provenance graphs of 3000 production ML pipelines at Google, comprising over 450,000 models trained, spanning a period of over four months.
Our analysis reveals the characteristics, components, and topologies of typical industry-strength ML pipelines at various granularities.
We identify several rich opportunities for optimization, leveraging traditional data management ideas.
- Score: 5.510431861706128
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning (ML) is now commonplace, powering data-driven applications
in various organizations. Unlike the traditional perception of ML in research,
ML production pipelines are complex, with many interlocking analytical
components beyond training, whose sub-parts are often run multiple times on
overlapping subsets of data. However, there is a lack of quantitative evidence
regarding the lifespan, architecture, frequency, and complexity of these
pipelines to understand how data management research can be used to make them
more efficient, effective, robust, and reproducible. To that end, we analyze
the provenance graphs of 3000 production ML pipelines at Google, comprising
over 450,000 models trained, spanning a period of over four months, in an
effort to understand the complexity and challenges underlying production ML.
Our analysis reveals the characteristics, components, and topologies of typical
industry-strength ML pipelines at various granularities. Along the way, we
introduce a specialized data model for representing and reasoning about
repeatedly run components in these ML pipelines, which we call model graphlets.
We identify several rich opportunities for optimization, leveraging traditional
data management ideas. We show how targeting even one of these opportunities,
i.e., identifying and pruning wasted computation that does not translate to
model deployment, can reduce wasted computation cost by 50% without
compromising the model deployment cadence.
Related papers
- Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - Limits of Transformer Language Models on Learning to Compose Algorithms [77.2443883991608]
We evaluate training LLaMA models and prompting GPT-4 and Gemini on four tasks demanding to learn a composition of several discrete sub-tasks.
Our results indicate that compositional learning in state-of-the-art Transformer language models is highly sample inefficient.
arXiv Detail & Related papers (2024-02-08T16:23:29Z) - Benchmarking Automated Machine Learning Methods for Price Forecasting
Applications [58.720142291102135]
We show the possibility of substituting manually created ML pipelines with automated machine learning (AutoML) solutions.
Based on the CRISP-DM process, we split the manual ML pipeline into a machine learning and non-machine learning part.
We show in a case study for the industrial use case of price forecasting, that domain knowledge combined with AutoML can weaken the dependence on ML experts.
arXiv Detail & Related papers (2023-04-28T10:27:38Z) - Analytical Engines With Context-Rich Processing: Towards Efficient
Next-Generation Analytics [12.317930859033149]
We envision an analytical engine co-optimized with components that enable context-rich analysis.
We aim for a holistic pipeline cost- and rule-based optimization across relational and model-based operators.
arXiv Detail & Related papers (2022-12-14T21:46:33Z) - Modeling Quality and Machine Learning Pipelines through Extended Feature
Models [0.0]
We propose a new engineering approach for quality ML pipeline by properly extending the Feature Models meta-model.
The presented approach allows to model ML pipelines, their quality requirements (on the whole pipeline and on single phases) and quality characteristics of algorithms used to implement each pipeline phase.
arXiv Detail & Related papers (2022-07-15T15:20:28Z) - Data Debugging with Shapley Importance over End-to-End Machine Learning
Pipelines [27.461398584509755]
DataScope is the first system that efficiently computes Shapley values of training examples over an end-to-end machine learning pipeline.
Our results show that DataScope is up to four orders of magnitude faster than state-of-the-art Monte Carlo-based methods.
arXiv Detail & Related papers (2022-04-23T19:29:23Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - A Rigorous Machine Learning Analysis Pipeline for Biomedical Binary
Classification: Application in Pancreatic Cancer Nested Case-control Studies
with Implications for Bias Assessments [2.9726886415710276]
We have laid out and assembled a complete, rigorous ML analysis pipeline focused on binary classification.
This 'automated' but customizable pipeline includes a) exploratory analysis, b) data cleaning and transformation, c) feature selection, d) model training with 9 established ML algorithms.
We apply this pipeline to an epidemiological investigation of established and newly identified risk factors for cancer to evaluate how different sources of bias might be handled by ML algorithms.
arXiv Detail & Related papers (2020-08-28T19:58:05Z) - PipeSim: Trace-driven Simulation of Large-Scale AI Operations Platforms [4.060731229044571]
We present a trace-driven simulation-based experimentation and analytics environment for large-scale AI systems.
Analytics data from a production-grade AI platform developed at IBM are used to build a comprehensive simulation model.
We implement the model in a standalone, discrete event simulator, and provide a toolkit for running experiments.
arXiv Detail & Related papers (2020-06-22T19:55:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.