Conformalized Frequency Estimation from Sketched Data
- URL: http://arxiv.org/abs/2204.04270v1
- Date: Fri, 8 Apr 2022 19:39:37 GMT
- Title: Conformalized Frequency Estimation from Sketched Data
- Authors: Matteo Sesia and Stefano Favaro
- Abstract summary: A flexible conformal inference method is developed to construct confidence intervals for the frequencies of objects queried in a very large data set.
The approach is completely data-adaptive and makes no use of any knowledge of the population distribution or of the inner workings of the sketching algorithm.
- Score: 6.510507449705344
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: A flexible conformal inference method is developed to construct confidence
intervals for the frequencies of queried objects in a very large data set,
based on the information contained in a much smaller sketch of those data. The
approach is completely data-adaptive and makes no use of any knowledge of the
population distribution or of the inner workings of the sketching algorithm;
instead, it constructs provably valid frequentist confidence intervals under
the sole assumption of data exchangeability. Although the proposed solution is
much more broadly applicable, this paper explicitly demonstrates its use in
combination with the famous count-min sketch algorithm and a non-linear
variation thereof to facilitate the exposition. The performance is compared to
that of existing frequentist and Bayesian alternatives through several
experiments with synthetic data as well as with real data sets consisting of
SARS-CoV-2 DNA sequences and classic English literature.
Related papers
- Improving Grammatical Error Correction via Contextual Data Augmentation [49.746484518527716]
We propose a synthetic data construction method based on contextual augmentation.
Specifically, we combine rule-based substitution with model-based generation.
We also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data.
arXiv Detail & Related papers (2024-06-25T10:49:56Z) - Anomalous Change Point Detection Using Probabilistic Predictive Coding [13.719066883151623]
We propose a deep learning-based CPD/AD method called Probabilistic Predictive Coding (PPC)
PPC jointly learns to encode sequential data to low dimensional latent space representations and to predict the subsequent data representations as well as the corresponding prediction uncertainties.
We demonstrate the effectiveness and adaptability of our proposed method across synthetic time series experiments, image data, and real-world magnetic resonance spectroscopic imaging data.
arXiv Detail & Related papers (2024-05-24T17:17:34Z) - Inference With Combining Rules From Multiple Differentially Private Synthetic Datasets [0.0]
We study the applicability of procedures based on combining rules to the analysis of DIPS datasets.
Our empirical experiments show that the proposed combining rules may offer accurate inference in certain contexts, but not in all cases.
arXiv Detail & Related papers (2024-05-08T02:33:35Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Conformal Frequency Estimation using Discrete Sketched Data with
Coverage for Distinct Queries [35.67445122503686]
This paper develops conformal inference methods to construct a confidence interval for the frequency of a queried object in a very large discrete data set.
We show our methods have improved empirical performance compared to existing frequentist and Bayesian alternatives in simulations.
arXiv Detail & Related papers (2022-11-09T00:05:29Z) - Toward Learning Robust and Invariant Representations with Alignment
Regularization and Data Augmentation [76.85274970052762]
This paper is motivated by a proliferation of options of alignment regularizations.
We evaluate the performances of several popular design choices along the dimensions of robustness and invariance.
We also formally analyze the behavior of alignment regularization to complement our empirical study under assumptions we consider realistic.
arXiv Detail & Related papers (2022-06-04T04:29:19Z) - The Optimal Noise in Noise-Contrastive Learning Is Not What You Think [80.07065346699005]
We show that deviating from this assumption can actually lead to better statistical estimators.
In particular, the optimal noise distribution is different from the data's and even from a different family.
arXiv Detail & Related papers (2022-03-02T13:59:20Z) - Approximate Bayesian Computation with Path Signatures [0.5156484100374059]
We introduce the use of path signatures as a natural candidate feature set for constructing distances between time series data.
Our experiments show that such an approach can generate more accurate approximate Bayesian posteriors than existing techniques for time series models.
arXiv Detail & Related papers (2021-06-23T17:25:43Z) - PriorGrad: Improving Conditional Denoising Diffusion Models with
Data-Driven Adaptive Prior [103.00403682863427]
We propose PriorGrad to improve the efficiency of the conditional diffusion model.
We show that PriorGrad achieves a faster convergence leading to data and parameter efficiency and improved quality.
arXiv Detail & Related papers (2021-06-11T14:04:03Z) - Towards Synthetic Multivariate Time Series Generation for Flare
Forecasting [5.098461305284216]
One of the limiting factors in training data-driven, rare-event prediction algorithms is the scarcity of the events of interest.
In this study, we explore the usefulness of the conditional generative adversarial network (CGAN) as a means to perform data-informed oversampling.
arXiv Detail & Related papers (2021-05-16T22:23:23Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.