Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"
- URL: http://arxiv.org/abs/2404.19591v1
- Date: Tue, 30 Apr 2024 14:36:04 GMT
- Title: Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"
- Authors: Stefan Grafberger, Paul Groth, Sebastian Schelter,
- Abstract summary: Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings.
We propose to support data scientists during this development cycle with automatically derived interactive suggestions for pipeline improvements.
We discuss our vision to generate these suggestions with so-called shadow pipelines, hidden variants of the original pipeline that modify it to auto-detect potential issues, try out modifications for improvements, and suggest and explain these modifications to the user.
We envision to apply incremental view maintenance-based optimisations to ensure low-latency computation and maintenance of the shadow pipelines
- Score: 13.559945160284876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and error-prone. Therefore, we propose to support data scientists during this development cycle with automatically derived interactive suggestions for pipeline improvements. We discuss our vision to generate these suggestions with so-called shadow pipelines, hidden variants of the original pipeline that modify it to auto-detect potential issues, try out modifications for improvements, and suggest and explain these modifications to the user. We envision to apply incremental view maintenance-based optimisations to ensure low-latency computation and maintenance of the shadow pipelines. We conduct preliminary experiments to showcase the feasibility of our envisioned approach and the potential benefits of our proposed optimisations.
Related papers
- Trusted Provenance of Automated, Collaborative and Adaptive Data Processing Pipelines [2.186901738997927]
We provide a solution architecture and a proof of concept implementation of a service, called Provenance Holder.
Provenance Holder enables provenance of collaborative, adaptive data processing pipelines in a trusted manner.
arXiv Detail & Related papers (2023-10-17T17:52:27Z) - End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes [52.818579746354665]
This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures.
We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data.
arXiv Detail & Related papers (2023-05-25T10:58:46Z) - Deep Pipeline Embeddings for AutoML [11.168121941015015]
AutoML is a promising direction for democratizing AI by automatically deploying Machine Learning systems with minimal human expertise.
Existing Pipeline Optimization techniques fail to explore deep interactions between pipeline stages/components.
This paper proposes a novel neural architecture that captures the deep interaction between the components of a Machine Learning pipeline.
arXiv Detail & Related papers (2023-05-23T12:40:38Z) - End-to-End Diffusion Latent Optimization Improves Classifier Guidance [81.27364542975235]
Direct Optimization of Diffusion Latents (DOODL) is a novel guidance method.
It enables plug-and-play guidance by optimizing diffusion latents.
It outperforms one-step classifier guidance on computational and human evaluation metrics.
arXiv Detail & Related papers (2023-03-23T22:43:52Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - Controllable Data Augmentation Through Deep Relighting [75.96144853354362]
We explore how to augment a varied set of image datasets through relighting so as to improve the ability of existing models to be invariant to illumination changes.
We develop a tool, based on an encoder-decoder network, that is able to quickly generate multiple variations of the illumination of various input scenes.
We demonstrate that by training models on datasets that have been augmented with our pipeline, it is possible to achieve higher performance on localization benchmarks.
arXiv Detail & Related papers (2021-10-26T20:02:51Z) - AutoWeka4MCPS-AVATAR: Accelerating Automated Machine Learning Pipeline
Composition and Optimisation [13.116806430326513]
We propose a novel method to evaluate the validity of ML pipelines, without their execution, using a surrogate model (AVATAR)
The AVATAR generates a knowledge base by automatically learning the capabilities and effects of ML algorithms on datasets' characteristics.
Instead of executing the original ML pipeline to evaluate its validity, the AVATAR evaluates its surrogate model constructed by capabilities and effects of the ML pipeline components.
arXiv Detail & Related papers (2020-11-21T14:05:49Z) - Self-Tuning Stochastic Optimization with Curvature-Aware Gradient
Filtering [53.523517926927894]
We explore the use of exact per-sample Hessian-vector products and gradients to construct self-tuning quadratics.
We prove that our model-based procedure converges in noisy gradient setting.
This is an interesting step for constructing self-tuning quadratics.
arXiv Detail & Related papers (2020-11-09T22:07:30Z) - Stochastic Optimization with Laggard Data Pipelines [65.20044914532221]
We show that "dataechoed" extensions of common optimization methods exhibit provable improvements over their synchronous counterparts.
Specifically, we show that in convex optimization with minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.
arXiv Detail & Related papers (2020-10-26T14:55:31Z) - AVATAR -- Machine Learning Pipeline Evaluation Using Surrogate Model [10.83607599315401]
We propose a novel method to evaluate the validity of ML pipelines using a surrogate model (AVATAR)
Our experiments show that the AVATAR is more efficient in evaluating complex pipelines in comparison with the traditional evaluation approaches requiring their execution.
arXiv Detail & Related papers (2020-01-30T02:53:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.