Conformal Frequency Estimation using Discrete Sketched Data with
Coverage for Distinct Queries
- URL: http://arxiv.org/abs/2211.04612v2
- Date: Tue, 15 Aug 2023 23:27:12 GMT
- Title: Conformal Frequency Estimation using Discrete Sketched Data with
Coverage for Distinct Queries
- Authors: Matteo Sesia, Stefano Favaro, Edgar Dobriban
- Abstract summary: This paper develops conformal inference methods to construct a confidence interval for the frequency of a queried object in a very large discrete data set.
We show our methods have improved empirical performance compared to existing frequentist and Bayesian alternatives in simulations.
- Score: 35.67445122503686
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper develops conformal inference methods to construct a confidence
interval for the frequency of a queried object in a very large discrete data
set, based on a sketch with a lower memory footprint. This approach requires no
knowledge of the data distribution and can be combined with any sketching
algorithm, including but not limited to the renowned count-min sketch, the
count-sketch, and variations thereof. After explaining how to achieve marginal
coverage for exchangeable random queries, we extend our solution to provide
stronger inferences that can account for the discreteness of the data and for
heterogeneous query frequencies, increasing also robustness to possible
distribution shifts. These results are facilitated by a novel conformal
calibration technique that guarantees valid coverage for a large fraction of
distinct random queries. Finally, we show our methods have improved empirical
performance compared to existing frequentist and Bayesian alternatives in
simulations as well as in examples of text and SARS-CoV-2 DNA data.
Related papers
- Distributed Markov Chain Monte Carlo Sampling based on the Alternating
Direction Method of Multipliers [143.6249073384419]
In this paper, we propose a distributed sampling scheme based on the alternating direction method of multipliers.
We provide both theoretical guarantees of our algorithm's convergence and experimental evidence of its superiority to the state-of-the-art.
In simulation, we deploy our algorithm on linear and logistic regression tasks and illustrate its fast convergence compared to existing gradient-based methods.
arXiv Detail & Related papers (2024-01-29T02:08:40Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Derandomized Novelty Detection with FDR Control via Conformal E-values [20.864605211132663]
We propose to make conformal inferences more stable by leveraging suitable conformal e-values instead of p-values.
We show that the proposed method can reduce randomness without much loss of power compared to standard conformal inference.
arXiv Detail & Related papers (2023-02-14T19:21:44Z) - Differentially Private Federated Clustering over Non-IID Data [59.611244450530315]
clustering clusters (FedC) problem aims to accurately partition unlabeled data samples distributed over massive clients into finite clients under the orchestration of a server.
We propose a novel FedC algorithm using differential privacy convergence technique, referred to as DP-Fed, in which partial participation and multiple clients are also considered.
Various attributes of the proposed DP-Fed are obtained through theoretical analyses of privacy protection, especially for the case of non-identically and independently distributed (non-i.i.d.) data.
arXiv Detail & Related papers (2023-01-03T05:38:43Z) - Conformalized Frequency Estimation from Sketched Data [6.510507449705344]
A flexible conformal inference method is developed to construct confidence intervals for the frequencies of objects queried in a very large data set.
The approach is completely data-adaptive and makes no use of any knowledge of the population distribution or of the inner workings of the sketching algorithm.
arXiv Detail & Related papers (2022-04-08T19:39:37Z) - Robust and Provable Guarantees for Sparse Random Embeddings [72.24615341588846]
We improve upon the guarantees for sparse random embeddings provided by Freksen at al. (NIPS'18) and Jagadeesan (NIPS'18)
We show that (a) our bounds are explicit as opposed to the guarantees provided previously, and (b) our bounds are guaranteed to be sharper by practically significant constants.
We empirically demonstrate that our bounds significantly outperform prior works on a wide range of real-world datasets.
arXiv Detail & Related papers (2022-02-22T11:15:59Z) - On Sparse High-Dimensional Graphical Model Learning For Dependent Time Series [12.94486861344922]
We consider the problem of inferring the conditional independence graph (CIG) of a sparse, high-dimensional stationary time series.
A sparse-group lasso-based frequency-domain formulation of the problem is presented.
We also empirically investigate selection of the tuning parameters based on Bayesian information criterion.
arXiv Detail & Related papers (2021-11-15T16:52:02Z) - Generalization in the Face of Adaptivity: A Bayesian Perspective [3.0202264016476623]
Repeated use of a data sample via adaptively chosen queries can rapidly lead to overfitting.
It turns out that simple noise unbounded addition algorithms suffice to prevent this issue.
We show that the harm of adaptivity results from the covariance between the new query and a Bayes factor-based measure of how much information about the data sample was encoded in the responses given to past queries.
arXiv Detail & Related papers (2021-06-20T22:06:44Z) - Improved, Deterministic Smoothing for L1 Certified Robustness [119.86676998327864]
We propose a non-additive and deterministic smoothing method, Deterministic Smoothing with Splitting Noise (DSSN)
In contrast to uniform additive smoothing, the SSN certification does not require the random noise components used to be independent.
This is the first work to provide deterministic "randomized smoothing" for a norm-based adversarial threat model.
arXiv Detail & Related papers (2021-03-17T21:49:53Z) - Beyond the Mean-Field: Structured Deep Gaussian Processes Improve the
Predictive Uncertainties [12.068153197381575]
We propose a novel variational family that allows for retaining covariances between latent processes while achieving fast convergence.
We provide an efficient implementation of our new approach and apply it to several benchmark datasets.
It yields excellent results and strikes a better balance between accuracy and calibrated uncertainty estimates than its state-of-the-art alternatives.
arXiv Detail & Related papers (2020-05-22T11:10:59Z) - Spatially Adaptive Inference with Stochastic Feature Sampling and
Interpolation [72.40827239394565]
We propose to compute features only at sparsely sampled locations.
We then densely reconstruct the feature map with an efficient procedure.
The presented network is experimentally shown to save substantial computation while maintaining accuracy over a variety of computer vision tasks.
arXiv Detail & Related papers (2020-03-19T15:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.