Solving Data Quality Problems with Desbordante: a Demo
- URL: http://arxiv.org/abs/2307.14935v2
- Date: Fri, 28 Jul 2023 11:02:25 GMT
- Title: Solving Data Quality Problems with Desbordante: a Demo
- Authors: George Chernishev, Michael Polyntsov, Anton Chizhov, Kirill Stupakov,
Ilya Shchuckin, Alexander Smirnov, Maxim Strutovsky, Alexey Shlyonskikh,
Mikhail Firsov, Stepan Manannikov, Nikita Bobrov, Daniil Goncharov, Ilia
Barutkin, Vladislav Shalnev, Kirill Muraviev, Anna Rakhmukova, Dmitriy
Shcheka, Anton Chernikov, Mikhail Vyrodov, Yaroslav Kurbatov, Maxim Fofanov,
Sergei Belokonnyi, Pavel Anosov, Arthur Saliou, Eduard Gaisin, Kirill Smirnov
- Abstract summary: Desbordante is an open-source data profiler that aims to close this gap.
It is built with emphasis on industrial application: it is efficient, scalable, resilient to crashes, and provides explanations.
In this demonstration, we show several scenarios that allow end users to solve different data quality problems.
- Score: 35.75243108496634
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Data profiling is an essential process in modern data-driven industries. One
of its critical components is the discovery and validation of complex
statistics, including functional dependencies, data constraints, association
rules, and others.
However, most existing data profiling systems that focus on complex
statistics do not provide proper integration with the tools used by
contemporary data scientists. This creates a significant barrier to the
adoption of these tools in the industry. Moreover, existing systems were not
created with industrial-grade workloads in mind. Finally, they do not aim to
provide descriptive explanations, i.e. why a given pattern is not found. It is
a significant issue as it is essential to understand the underlying reasons for
a specific pattern's absence to make informed decisions based on the data.
Because of that, these patterns are effectively rest in thin air: their
application scope is rather limited, they are rarely used by the broader
public. At the same time, as we are going to demonstrate in this presentation,
complex statistics can be efficiently used to solve many classic data quality
problems.
Desbordante is an open-source data profiler that aims to close this gap. It
is built with emphasis on industrial application: it is efficient, scalable,
resilient to crashes, and provides explanations. Furthermore, it provides
seamless Python integration by offloading various costly operations to the C++
core, not only mining.
In this demonstration, we show several scenarios that allow end users to
solve different data quality problems. Namely, we showcase typo detection, data
deduplication, and data anomaly detection scenarios.
Related papers
- Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? [60.50127555651554]
Large Language Models (LLMs) show impressive results in numerous practical applications, but they lack essential safety features.
This makes them vulnerable to manipulations such as indirect prompt injections and generally unsuitable for safety-critical tasks.
We introduce a formal measure for instruction-data separation and an empirical variant that is calculable from a model's outputs.
arXiv Detail & Related papers (2024-03-11T15:48:56Z) - Making Large Language Models Better Data Creators [22.0882632635255]
Large language models (LLMs) have advanced the state-of-the-art in NLP significantly.
deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security.
We propose a unified data creation pipeline that requires only a single format example.
arXiv Detail & Related papers (2023-10-31T01:08:34Z) - $\texttt{causalAssembly}$: Generating Realistic Production Data for
Benchmarking Causal Discovery [1.3048920509133808]
We build a system for generation of semisynthetic manufacturing data that supports benchmarking of causal discovery methods.
We employ distributional random forests to flexibly estimate and represent conditional distributions.
Using the library, we showcase how to benchmark several well-known causal discovery algorithms.
arXiv Detail & Related papers (2023-06-19T10:05:54Z) - Boosting Synthetic Data Generation with Effective Nonlinear Causal
Discovery [11.81479419498206]
In software testing, data privacy, imbalanced learning, and artificial intelligence explanation, it is crucial to generate plausible data samples.
A common assumption of approaches widely used for data generation is the independence of the features.
We propose a synthetic dataset generator that can discover nonlinear causalities among the variables and use them at generation time.
arXiv Detail & Related papers (2023-01-18T10:54:06Z) - Kubric: A scalable dataset generator [73.78485189435729]
Kubric is a Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines.
We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation.
arXiv Detail & Related papers (2022-03-07T18:13:59Z) - Understanding and Preparing Data of Industrial Processes for Machine
Learning Applications [0.0]
This paper addresses the challenge of missing values due to sensor unavailability at different production units of nonlinear production lines.
In cases where only a small proportion of the data is missing, those missing values can often be imputed.
This paper presents a technique, that allows to utilize all of the available data without the need of removing large amounts of observations.
arXiv Detail & Related papers (2021-09-08T07:39:11Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Diverse Complexity Measures for Dataset Curation in Self-driving [80.55417232642124]
We propose a new data selection method that exploits a diverse set of criteria that quantize interestingness of traffic scenes.
Our experiments show that the proposed curation pipeline is able to select datasets that lead to better generalization and higher performance.
arXiv Detail & Related papers (2021-01-16T23:45:02Z) - Learning Causal Models Online [103.87959747047158]
Predictive models can rely on spurious correlations in the data for making predictions.
One solution for achieving strong generalization is to incorporate causal structures in the models.
We propose an online algorithm that continually detects and removes spurious features.
arXiv Detail & Related papers (2020-06-12T20:49:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.