Auto-Validate by-History: Auto-Program Data Quality Constraints to
Validate Recurring Data Pipelines
- URL: http://arxiv.org/abs/2306.02421v1
- Date: Sun, 4 Jun 2023 17:53:30 GMT
- Title: Auto-Validate by-History: Auto-Program Data Quality Constraints to
Validate Recurring Data Pipelines
- Authors: Dezhan Tu, Yeye He, Weiwei Cui, Song Ge, Haidong Zhang, Han Shi,
Dongmei Zhang, Surajit Chaudhuri
- Abstract summary: Data pipelines are widely employed in modern enterprises to power a variety of Machine-Learning (ML) and Business-Intelligence (BI) applications.
Data quality (DQ) issues can often creep into recurring pipelines because of upstream schema and data drift over time.
We propose Auto-by-History (AVH) that can automatically detect DQ issues in recurring pipelines.
- Score: 41.39496264168388
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data pipelines are widely employed in modern enterprises to power a variety
of Machine-Learning (ML) and Business-Intelligence (BI) applications.
Crucially, these pipelines are \emph{recurring} (e.g., daily or hourly) in
production settings to keep data updated so that ML models can be re-trained
regularly, and BI dashboards refreshed frequently. However, data quality (DQ)
issues can often creep into recurring pipelines because of upstream schema and
data drift over time. As modern enterprises operate thousands of recurring
pipelines, today data engineers have to spend substantial efforts to
\emph{manually} monitor and resolve DQ issues, as part of their DataOps and
MLOps practices.
Given the high human cost of managing large-scale pipeline operations, it is
imperative that we can \emph{automate} as much as possible. In this work, we
propose Auto-Validate-by-History (AVH) that can automatically detect DQ issues
in recurring pipelines, leveraging rich statistics from historical executions.
We formalize this as an optimization problem, and develop constant-factor
approximation algorithms with provable precision guarantees. Extensive
evaluations using 2000 production data pipelines at Microsoft demonstrate the
effectiveness and efficiency of AVH.
Related papers
- Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering.
Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications.
These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z) - Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [49.724842920942024]
Industries such as finance, meteorology, and energy generate vast amounts of data daily.
We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
arXiv Detail & Related papers (2023-06-12T16:12:56Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - LogLAB: Attention-Based Labeling of Log Data Anomalies via Weak
Supervision [63.08516384181491]
We present LogLAB, a novel modeling approach for automated labeling of log messages without requiring manual work by experts.
Our method relies on estimated failure time windows provided by monitoring systems to produce precise labeled datasets in retrospect.
Our evaluation shows that LogLAB consistently outperforms nine benchmark approaches across three different datasets and maintains an F1-score of more than 0.98 even at large failure time windows.
arXiv Detail & Related papers (2021-11-02T15:16:08Z) - AI Total: Analyzing Security ML Models with Imperfect Data in Production [2.629585075202626]
Development of new machine learning models is typically done on manually curated data sets.
We develop a web-based visualization system that allows the users to quickly gather headline performance numbers.
It also enables the users to immediately observe the root cause of an issue when something goes wrong.
arXiv Detail & Related papers (2021-10-13T20:56:05Z) - Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns
Inferred from Data Lakes [16.392844962056742]
We develop a corpus-driven approach to auto-validate emphmachine-generated data by inferring suitable data-validation "patterns"
Part of this technology ships as an Auto-Tag feature in Microsoft Azure Purview.
arXiv Detail & Related papers (2021-04-10T01:15:48Z) - AutoWeka4MCPS-AVATAR: Accelerating Automated Machine Learning Pipeline
Composition and Optimisation [13.116806430326513]
We propose a novel method to evaluate the validity of ML pipelines, without their execution, using a surrogate model (AVATAR)
The AVATAR generates a knowledge base by automatically learning the capabilities and effects of ML algorithms on datasets' characteristics.
Instead of executing the original ML pipeline to evaluate its validity, the AVATAR evaluates its surrogate model constructed by capabilities and effects of the ML pipeline components.
arXiv Detail & Related papers (2020-11-21T14:05:49Z) - MLCask: Efficient Management of Component Evolution in Collaborative
Data Analytics Pipelines [29.999324319722508]
We address two main challenges that arise during the deployment of machine learning pipelines, and address them with the design of versioning for an end-to-end analytics system MLCask.
We define and accelerate the metric-driven merge operation by pruning the pipeline search tree using reusable history records and pipeline compatibility information.
The effectiveness of MLCask is evaluated through an extensive study over several real-world deployment cases.
arXiv Detail & Related papers (2020-10-17T13:34:48Z) - AVATAR -- Machine Learning Pipeline Evaluation Using Surrogate Model [10.83607599315401]
We propose a novel method to evaluate the validity of ML pipelines using a surrogate model (AVATAR)
Our experiments show that the AVATAR is more efficient in evaluating complex pipelines in comparison with the traditional evaluation approaches requiring their execution.
arXiv Detail & Related papers (2020-01-30T02:53:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.