Related papers: Instrumentation and Analysis of Native ML Pipelines via Logical Query Plans

Instrumentation and Analysis of Native ML Pipelines via Logical Query Plans

URL: http://arxiv.org/abs/2407.07560v1
Date: Wed, 10 Jul 2024 11:35:02 GMT
Title: Instrumentation and Analysis of Native ML Pipelines via Logical Query Plans
Authors: Stefan Grafberger,
Abstract summary: We envision highly-automated software platforms to assist data scientists with developing, validating, monitoring, and analysing their Machine Learning pipelines. We extract "logical query plans" from ML pipeline code relying on popular libraries. Based on these plans, we automatically infer pipeline semantics and instrument and rewrite the ML pipelines to enable diverse use cases without requiring data scientists to manually annotate or rewrite their code.
Score: 3.2362171533623054
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Machine Learning (ML) is increasingly used to automate impactful decisions, which leads to concerns regarding their correctness, reliability, and fairness. We envision highly-automated software platforms to assist data scientists with developing, validating, monitoring, and analysing their ML pipelines. In contrast to existing work, our key idea is to extract "logical query plans" from ML pipeline code relying on popular libraries. Based on these plans, we automatically infer pipeline semantics and instrument and rewrite the ML pipelines to enable diverse use cases without requiring data scientists to manually annotate or rewrite their code. First, we developed such an abstract ML pipeline representation together with machinery to extract it from Python code. Next, we used this representation to efficiently instrument static ML pipelines and apply provenance tracking, which enables lightweight screening for common data preparation issues. Finally, we built machinery to automatically rewrite ML pipelines to perform more advanced what-if analyses and proposed using multi-query optimisation for the resulting workloads. In future work, we aim to interactively assist data scientists as they work on their ML pipelines.

Related papers

stratum: A System Infrastructure for Massive Agent-Centric ML Workloads [8.123450153690424]
Large language models (LLMs) generate, validate, and optimize complete machine learning (ML) pipelines.<n>The existing Python-based ML ecosystem is built around libraries such as Panda scikit-learn.<n>We propose stratum, a unified system infrastructure that decouples pipeline execution from planning and reasoning.
arXiv Detail & Related papers (2026-03-03T23:43:12Z)
SemPipes -- Optimizable Semantic Data Operators for Tabular Machine Learning Pipelines [12.816711873869984]
We introduce SemPipes, a novel declarative programming model that integrates semantic data operators into ML pipelines.<n>SemPipes synthesizes custom operator implementations based on data characteristics, operator instructions, and pipeline context.<n>We show that semantic operators substantially improve end-to-end predictive performance for both expert-designed and agent-generated pipelines.
arXiv Detail & Related papers (2026-02-04T23:36:29Z)
ExeKGLib: A Platform for Machine Learning Analytics based on Knowledge Graphs [6.611237989022405]
We present ExeKGLib, a Python library enhanced with a graphical interface layer that allows users with minimal ML knowledge to build ML pipelines.<n>This is achieved by relying on knowledge graphs that encode ML knowledge in simple terms to non-ML experts.
arXiv Detail & Related papers (2025-08-01T07:45:49Z)
AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML [56.565200973244146]
Automated machine learning (AutoML) accelerates AI development by automating tasks in the development pipeline. Recent works have started exploiting large language models (LLM) to lessen such burden. This paper proposes AutoML-Agent, a novel multi-agent framework tailored for full-pipeline AutoML.
arXiv Detail & Related papers (2024-10-03T20:01:09Z)
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering. Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z)
Closing the loop: Autonomous experiments enabled by machine-learning-based online data analysis in synchrotron beamline environments [80.49514665620008]
Machine learning can be used to enhance research involving large or rapidly generated datasets. In this study, we describe the incorporation of ML into a closed-loop workflow for X-ray reflectometry (XRR) We present solutions that provide an elementary data analysis in real time during the experiment without introducing the additional software dependencies in the beamline control software environment.
arXiv Detail & Related papers (2023-06-20T21:21:19Z)
Deep Pipeline Embeddings for AutoML [11.168121941015015]
AutoML is a promising direction for democratizing AI by automatically deploying Machine Learning systems with minimal human expertise. Existing Pipeline Optimization techniques fail to explore deep interactions between pipeline stages/components. This paper proposes a novel neural architecture that captures the deep interaction between the components of a Machine Learning pipeline.
arXiv Detail & Related papers (2023-05-23T12:40:38Z)
Benchmarking Automated Machine Learning Methods for Price Forecasting Applications [58.720142291102135]
We show the possibility of substituting manually created ML pipelines with automated machine learning (AutoML) solutions. Based on the CRISP-DM process, we split the manual ML pipeline into a machine learning and non-machine learning part. We show in a case study for the industrial use case of price forecasting, that domain knowledge combined with AutoML can weaken the dependence on ML experts.
arXiv Detail & Related papers (2023-04-28T10:27:38Z)
XAutoML: A Visual Analytics Tool for Understanding and Validating Automated Machine Learning [5.633209323925663]
XAutoML is an interactive visual analytics tool for explaining arbitrary AutoML optimization procedures and ML pipelines constructed by AutoML. XAutoML combines interactive visualizations with established techniques from explainable artificial intelligence (XAI) to make the complete AutoML procedure transparent and explainable.
arXiv Detail & Related papers (2022-02-24T08:18:25Z)
SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions [28.718446733713183]
We propose an AutoML SapientML that can learn from a corpus of existing datasets and their human-written pipelines. We have created a training corpus of 1094 pipelines spanning 170 datasets, and evaluated SapientML on a set of 41 benchmark datasets. Our evaluation shows that SapientML produces the best or comparable accuracy on 27 of the benchmarks while the second best tool fails to even produce a pipeline on 9 of the instances.
arXiv Detail & Related papers (2022-02-18T20:45:47Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines [7.022239953701528]
We propose Plumber, a tool for finding bottlenecks in Machine Learning (ML) input pipelines. Across five representative ML pipelines, Plumber obtains speedups of up to 46x for pipelines. By automating caching, Plumber obtains end-to-end speedups of over 40% compared to state-of-the-art tuners.
arXiv Detail & Related papers (2021-11-07T17:15:57Z)
AutoWeka4MCPS-AVATAR: Accelerating Automated Machine Learning Pipeline Composition and Optimisation [13.116806430326513]
We propose a novel method to evaluate the validity of ML pipelines, without their execution, using a surrogate model (AVATAR) The AVATAR generates a knowledge base by automatically learning the capabilities and effects of ML algorithms on datasets' characteristics. Instead of executing the original ML pipeline to evaluate its validity, the AVATAR evaluates its surrogate model constructed by capabilities and effects of the ML pipeline components.
arXiv Detail & Related papers (2020-11-21T14:05:49Z)
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology Images [50.222197963803644]
Niffler is an integrated framework that enables the execution of machine learning pipelines at research clusters. Niffler uses the Digital Imaging and Communications in Medicine (DICOM) protocol to fetch and store imaging data. We present its architecture and three of its use cases: an inferior vena cava filter detection from the images in real-time, identification of scanner utilization, and scanner clock calibration.
arXiv Detail & Related papers (2020-04-16T21:06:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.