Learning to be a Statistician: Learned Estimator for Number of Distinct
Values
- URL: http://arxiv.org/abs/2202.02800v1
- Date: Sun, 6 Feb 2022 15:42:04 GMT
- Title: Learning to be a Statistician: Learned Estimator for Number of Distinct
Values
- Authors: Renzhi Wu, Bolin Ding, Xu Chu, Zhewei Wei, Xiening Dai, Tao Guan,
Jingren Zhou
- Abstract summary: Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
- Score: 54.629042119819744
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Estimating the number of distinct values (NDV) in a column is useful for many
tasks in database systems, such as columnstore compression and data profiling.
In this work, we focus on how to derive accurate NDV estimations from random
(online/offline) samples. Such efficient estimation is critical for tasks where
it is prohibitive to scan the data even once. Existing sample-based estimators
typically rely on heuristics or assumptions and do not have robust performance
across different datasets as the assumptions on data can easily break. On the
other hand, deriving an estimator from a principled formulation such as maximum
likelihood estimation is very challenging due to the complex structure of the
formulation. We propose to formulate the NDV estimation task in a supervised
learning framework, and aim to learn a model as the estimator. To this end, we
need to answer several questions: i) how to make the learned model workload
agnostic; ii) how to obtain training data; iii) how to perform model training.
We derive conditions of the learning framework under which the learned model is
workload agnostic, in the sense that the model/estimator can be trained with
synthetically generated training data, and then deployed into any data
warehouse simply as, e.g., user-defined functions (UDFs), to offer efficient
(within microseconds on CPU) and accurate NDV estimations for unseen tables and
workloads. We compare the learned estimator with the state-of-the-art
sample-based estimators on nine real-world datasets to demonstrate its superior
estimation accuracy. We publish our code for training data generation, model
training, and the learned estimator online for reproducibility.
Related papers
- Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation [17.351089059392674]
We propose a framework for model evaluation that includes stratification, sampling, and estimation components.
We show that stratification via k-means clustering based on accurate predictions of model performance yields efficient estimators.
We also find that model-assisted estimators, which leverage predictions of model accuracy on the unlabeled portion of the dataset, are generally more efficient than the traditional estimates.
arXiv Detail & Related papers (2024-06-11T14:49:04Z) - Is Data Valuation Learnable and Interpretable? [3.9325957466009203]
Current data valuation methods ignore the interpretability of the output values.
This study aims to answer an important question: is data valuation learnable and interpretable?
arXiv Detail & Related papers (2024-06-03T08:13:47Z) - Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value [17.340091573913316]
We propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate.
Data-OOB takes less than 2.25 hours on a single CPU processor when there are $106$ samples to evaluate and the input dimension is 100.
We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points.
arXiv Detail & Related papers (2023-04-16T08:03:58Z) - TRAK: Attributing Model Behavior at Scale [79.56020040993947]
We present TRAK (Tracing with Randomly-trained After Kernel), a data attribution method that is both effective and computationally tractable for large-scale, differenti models.
arXiv Detail & Related papers (2023-03-24T17:56:22Z) - A Meta-Learning Approach to Predicting Performance and Data Requirements [163.4412093478316]
We propose an approach to estimate the number of samples required for a model to reach a target performance.
We find that the power law, the de facto principle to estimate model performance, leads to large error when using a small dataset.
We introduce a novel piecewise power law (PPL) that handles the two data differently.
arXiv Detail & Related papers (2023-03-02T21:48:22Z) - How to Train an Accurate and Efficient Object Detection Model on Any
Dataset [0.0]
We propose a dataset-agnostic template for object detection trainings.
It consists of carefully chosen and pre-trained models together with a robust training pipeline for further training.
Our solution works out-of-the-box and provides a strong baseline on a wide range of datasets.
arXiv Detail & Related papers (2022-11-30T17:09:01Z) - Conformal prediction for the design problem [72.14982816083297]
In many real-world deployments of machine learning, we use a prediction algorithm to choose what data to test next.
In such settings, there is a distinct type of distribution shift between the training and test data.
We introduce a method to quantify predictive uncertainty in such settings.
arXiv Detail & Related papers (2022-02-08T02:59:12Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - Neural Approximate Sufficient Statistics for Implicit Models [34.44047460667847]
We frame the task of constructing sufficient statistics as learning mutual information maximizing representations of the data with the help of deep neural networks.
We apply our approach to both traditional approximate Bayesian computation and recent neural likelihood methods, boosting their performance on a range of tasks.
arXiv Detail & Related papers (2020-10-20T07:11:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.