Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns
Inferred from Data Lakes
- URL: http://arxiv.org/abs/2104.04659v2
- Date: Tue, 13 Apr 2021 17:29:18 GMT
- Title: Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns
Inferred from Data Lakes
- Authors: Jie Song, Yeye He
- Abstract summary: We develop a corpus-driven approach to auto-validate emphmachine-generated data by inferring suitable data-validation "patterns"
Part of this technology ships as an Auto-Tag feature in Microsoft Azure Purview.
- Score: 16.392844962056742
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Complex data pipelines are increasingly common in diverse applications such
as BI reporting and ML modeling. These pipelines often recur regularly (e.g.,
daily or weekly), as BI reports need to be refreshed, and ML models need to be
retrained. However, it is widely reported that in complex production pipelines,
upstream data feeds can change in unexpected ways, causing downstream
applications to break silently that are expensive to resolve.
Data validation has thus become an important topic, as evidenced by notable
recent efforts from Google and Amazon, where the objective is to catch data
quality issues early as they arise in the pipelines. Our experience on
production data suggests, however, that on string-valued data, these existing
approaches yield high false-positive rates and frequently require human
intervention. In this work, we develop a corpus-driven approach to
auto-validate \emph{machine-generated data} by inferring suitable
data-validation "patterns" that accurately describe the underlying data domain,
which minimizes false positives while maximizing data quality issues caught.
Evaluations using production data from real data lakes suggest that
Auto-Validate is substantially more effective than existing methods. Part of
this technology ships as an Auto-Tag feature in Microsoft Azure Purview.
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - Making Large Language Models Better Data Creators [22.0882632635255]
Large language models (LLMs) have advanced the state-of-the-art in NLP significantly.
deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security.
We propose a unified data creation pipeline that requires only a single format example.
arXiv Detail & Related papers (2023-10-31T01:08:34Z) - Better Practices for Domain Adaptation [62.70267990659201]
Domain adaptation (DA) aims to provide frameworks for adapting models to deployment data without using labels.
Unclear validation protocol for DA has led to bad practices in the literature.
We show challenges across all three branches of domain adaptation methodology.
arXiv Detail & Related papers (2023-09-07T17:44:18Z) - DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning
over Tabular Data [12.416345241511781]
We propose DiffPrep to automatically and efficiently search for a data preprocessing pipeline for a given dataset.
Our experiments show that DiffPrep achieves the best test accuracy on 15 out of the 18 real-world datasets evaluated.
arXiv Detail & Related papers (2023-08-20T23:40:26Z) - Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [49.724842920942024]
Industries such as finance, meteorology, and energy generate vast amounts of data daily.
We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
arXiv Detail & Related papers (2023-06-12T16:12:56Z) - Auto-Validate by-History: Auto-Program Data Quality Constraints to
Validate Recurring Data Pipelines [41.39496264168388]
Data pipelines are widely employed in modern enterprises to power a variety of Machine-Learning (ML) and Business-Intelligence (BI) applications.
Data quality (DQ) issues can often creep into recurring pipelines because of upstream schema and data drift over time.
We propose Auto-by-History (AVH) that can automatically detect DQ issues in recurring pipelines.
arXiv Detail & Related papers (2023-06-04T17:53:30Z) - AI Total: Analyzing Security ML Models with Imperfect Data in Production [2.629585075202626]
Development of new machine learning models is typically done on manually curated data sets.
We develop a web-based visualization system that allows the users to quickly gather headline performance numbers.
It also enables the users to immediately observe the root cause of an issue when something goes wrong.
arXiv Detail & Related papers (2021-10-13T20:56:05Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.