Desbordante: from benchmarking suite to high-performance
science-intensive data profiler (preprint)
- URL: http://arxiv.org/abs/2301.05965v1
- Date: Sat, 14 Jan 2023 19:14:51 GMT
- Title: Desbordante: from benchmarking suite to high-performance
science-intensive data profiler (preprint)
- Authors: George Chernishev, Michael Polyntsov, Anton Chizhov, Kirill Stupakov,
Ilya Shchuckin, Alexander Smirnov, Maxim Strutovsky, Alexey Shlyonskikh,
Mikhail Firsov, Stepan Manannikov, Nikita Bobrov, Daniil Goncharov, Ilia
Barutkin, Vladislav Shalnev, Kirill Muraviev, Anna Rakhmukova, Dmitriy
Shcheka, Anton Chernikov, Dmitrii Mandelshtam, Mikhail Vyrodov, Arthur
Saliou, Eduard Gaisin, Kirill Smirnov
- Abstract summary: Desbordante is a high-performance science-intensive data profiler with open source code.
Unlike similar systems, it is built with emphasis on industrial application in a multi-user environment.
It is efficient, resilient to crashes, and scalable.
- Score: 36.537985747809245
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Pioneering data profiling systems such as Metanome and OpenClean brought
public attention to science-intensive data profiling. This type of profiling
aims to extract complex patterns (primitives) such as functional dependencies,
data constraints, association rules, and others. However, these tools are
research prototypes rather than production-ready systems.
The following work presents Desbordante - a high-performance
science-intensive data profiler with open source code. Unlike similar systems,
it is built with emphasis on industrial application in a multi-user
environment. It is efficient, resilient to crashes, and scalable. Its
efficiency is ensured by implementing discovery algorithms in C++, resilience
is achieved by extensive use of containerization, and scalability is based on
replication of containers.
Desbordante aims to open industrial-grade primitive discovery to a broader
public, focusing on domain experts who are not IT professionals. Aside from the
discovery of various primitives, Desbordante offers primitive validation, which
not only reports whether a given instance of primitive holds or not, but also
points out what prevents it from holding via the use of special screens. Next,
Desbordante supports pipelines - ready-to-use functionality implemented using
the discovered primitives, for example, typo detection. We provide built-in
pipelines, and the users can construct their own via provided Python bindings.
Unlike other profilers, Desbordante works not only with tabular data, but with
graph and transactional data as well.
In this paper, we present Desbordante, the vision behind it and its
use-cases. To provide a more in-depth perspective, we discuss its current
state, architecture, and design decisions it is built on. Additionally, we
outline our future plans.
Related papers
- When in Doubt, Cascade: Towards Building Efficient and Capable Guardrails [19.80434777786657]
We develop a synthetic pipeline to generate targeted and labeled data.
We show that our method achieves competitive performance with a fraction of the cost in compute.
arXiv Detail & Related papers (2024-07-08T18:39:06Z) - Implicitly Guided Design with PropEn: Match your Data to Follow the Gradient [52.2669490431145]
PropEn is inspired by'matching', which enables implicit guidance without training a discriminator.
We show that training with a matched dataset approximates the gradient of the property of interest while remaining within the data distribution.
arXiv Detail & Related papers (2024-05-28T11:30:19Z) - Solving Data Quality Problems with Desbordante: a Demo [35.75243108496634]
Desbordante is an open-source data profiler that aims to close this gap.
It is built with emphasis on industrial application: it is efficient, scalable, resilient to crashes, and provides explanations.
In this demonstration, we show several scenarios that allow end users to solve different data quality problems.
arXiv Detail & Related papers (2023-07-27T15:26:26Z) - Fingerprinting and Building Large Reproducible Datasets [3.2873782624127843]
We propose a tool-supported approach facilitating the creation of large tailored datasets while ensuring their provenance.
We propose a way to define a unique fingerprint to characterize a dataset which, when provided to the extraction process, ensures that the same dataset will be extracted.
arXiv Detail & Related papers (2023-06-20T08:59:33Z) - Going beyond research datasets: Novel intent discovery in the industry
setting [60.90117614762879]
This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform.
We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision.
We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv.
arXiv Detail & Related papers (2023-05-09T14:21:29Z) - A Unified Active Learning Framework for Annotating Graph Data with
Application to Software Source Code Performance Prediction [4.572330678291241]
We develop a unified active learning framework specializing in software performance prediction.
We investigate the impact of using different levels of information for active and passive learning.
Our approach aims to improve the investment in AI models for different software performance predictions.
arXiv Detail & Related papers (2023-04-06T14:00:48Z) - Position Paper on Dataset Engineering to Accelerate Science [1.952708415083428]
In this work, we will use the token ittextdataset to designate a structured set of data built to perform a well-defined task.
Specifically, in science, each area has unique forms to organize, gather and handle its datasets.
We advocate that science and engineering discovery processes are extreme instances of the need for such organization on datasets.
arXiv Detail & Related papers (2023-03-09T19:07:40Z) - Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval [60.457378374671656]
Tevatron is a dense retrieval toolkit optimized for efficiency, flexibility, and code simplicity.
We show how Tevatron's flexible design enables easy generalization across datasets, model architectures, and accelerator platforms.
arXiv Detail & Related papers (2022-03-11T05:47:45Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Fully Convolutional Networks for Panoptic Segmentation [91.84686839549488]
We present a conceptually simple, strong, and efficient framework for panoptic segmentation, called Panoptic FCN.
Our approach aims to represent and predict foreground things and background stuff in a unified fully convolutional pipeline.
Panoptic FCN encodes each object instance or stuff category into a specific kernel weight with the proposed kernel generator.
arXiv Detail & Related papers (2020-12-01T18:31:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.