Trusted Provenance of Automated, Collaborative and Adaptive Data Processing Pipelines
- URL: http://arxiv.org/abs/2310.11442v1
- Date: Tue, 17 Oct 2023 17:52:27 GMT
- Title: Trusted Provenance of Automated, Collaborative and Adaptive Data Processing Pipelines
- Authors: Ludwig Stage, Dimka Karastoyanova,
- Abstract summary: We provide a solution architecture and a proof of concept implementation of a service, called Provenance Holder.
Provenance Holder enables provenance of collaborative, adaptive data processing pipelines in a trusted manner.
- Score: 2.186901738997927
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: To benefit from the abundance of data and the insights it brings data processing pipelines are being used in many areas of research and development in both industry and academia. One approach to automating data processing pipelines is the workflow technology, as it also supports collaborative, trial-and-error experimentation with the pipeline architecture in different application domains. In addition to the necessary flexibility that such pipelines need to possess, in collaborative settings cross-organisational interactions are plagued by lack of trust. While capturing provenance information related to the pipeline execution and the processed data is a first step towards enabling trusted collaborations, the current solutions do not allow for provenance of the change in the processing pipelines, where the subject of change can be made on any aspect of the workflow implementing the pipeline and on the data used while the pipeline is being executed. Therefore in this work we provide a solution architecture and a proof of concept implementation of a service, called Provenance Holder, which enable provenance of collaborative, adaptive data processing pipelines in a trusted manner. We also contribute a definition of a set of properties of such a service and identify future research directions.
Related papers
- Uncovering communities of pipelines in the task-fMRI analytical space [0.9217021281095907]
We show that there are subsets of pipelines that give similar results, especially those sharing specific parameters.
Those pipeline-to-pipeline patterns are stable across groups of participants but not across different tasks.
By visualizing the differences between communities, we show that the pipeline space is mainly driven by the size of the activation area in the brain.
arXiv Detail & Related papers (2023-12-11T09:18:14Z) - Deep Pipeline Embeddings for AutoML [11.168121941015015]
AutoML is a promising direction for democratizing AI by automatically deploying Machine Learning systems with minimal human expertise.
Existing Pipeline Optimization techniques fail to explore deep interactions between pipeline stages/components.
This paper proposes a novel neural architecture that captures the deep interaction between the components of a Machine Learning pipeline.
arXiv Detail & Related papers (2023-05-23T12:40:38Z) - Integrating pre-processing pipelines in ODC based framework [0.0]
This paper proposes a method to integrate virtual products based on integrating open-source processing pipelines.
In order to validate and evaluate the functioning of this approach, we have integrated it into a geo-imagery management framework.
arXiv Detail & Related papers (2022-10-04T11:12:09Z) - Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data
Programming [77.38174112525168]
We present Nemo, an end-to-end interactive Supervision system that improves overall productivity of WS learning pipeline by an average 20% (and up to 47% in one task) compared to the prevailing WS supervision approach.
arXiv Detail & Related papers (2022-03-02T19:57:32Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Automated Evolutionary Approach for the Design of Composite Machine
Learning Pipelines [48.7576911714538]
The proposed approach is aimed to automate the design of composite machine learning pipelines.
It designs the pipelines with a customizable graph-based structure, analyzes the obtained results, and reproduces them.
The software implementation on this approach is presented as an open-source framework.
arXiv Detail & Related papers (2021-06-26T23:19:06Z) - MLCask: Efficient Management of Component Evolution in Collaborative
Data Analytics Pipelines [29.999324319722508]
We address two main challenges that arise during the deployment of machine learning pipelines, and address them with the design of versioning for an end-to-end analytics system MLCask.
We define and accelerate the metric-driven merge operation by pruning the pipeline search tree using reusable history records and pipeline compatibility information.
The effectiveness of MLCask is evaluated through an extensive study over several real-world deployment cases.
arXiv Detail & Related papers (2020-10-17T13:34:48Z) - Petri Nets with Parameterised Data: Modelling and Verification (Extended
Version) [67.99023219822564]
We introduce and study an extension of coloured Petri nets, called catalog-nets, providing two key features to capture this type of processes.
We show that fresh-value injection is a particularly complex feature to handle, and discuss strategies to tame it.
arXiv Detail & Related papers (2020-06-11T17:26:08Z) - Rethinking Learning-based Demosaicing, Denoising, and Super-Resolution
Pipeline [86.01209981642005]
We study the effects of pipelines on the mixture problem of learning-based DN, DM, and SR, in both sequential and joint solutions.
Our suggested pipeline DN$to$SR$to$DM yields consistently better performance than other sequential pipelines.
We propose an end-to-end Trinity Pixel Enhancement NETwork (TENet) that achieves state-of-the-art performance for the mixture problem.
arXiv Detail & Related papers (2019-05-07T13:19:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.