Subsampling Suffices for Adaptive Data Analysis
- URL: http://arxiv.org/abs/2302.08661v2
- Date: Wed, 20 Sep 2023 20:37:55 GMT
- Title: Subsampling Suffices for Adaptive Data Analysis
- Authors: Guy Blanc
- Abstract summary: Most classical techniques assume that the dataset is independent of the analyst's query and break down in the common setting where a dataset is reused for multiple, adaptively chosen, queries.
We identify a remarkably simple set of assumptions under which the queries will continue to be representative even when chosen adaptively.
The simplicity of this subsampling-based framework allows it to model a variety of real-world scenarios not covered by prior work.
- Score: 8.231050911072755
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ensuring that analyses performed on a dataset are representative of the
entire population is one of the central problems in statistics. Most classical
techniques assume that the dataset is independent of the analyst's query and
break down in the common setting where a dataset is reused for multiple,
adaptively chosen, queries. This problem of \emph{adaptive data analysis} was
formalized in the seminal works of Dwork et al. (STOC, 2015) and Hardt and
Ullman (FOCS, 2014).
We identify a remarkably simple set of assumptions under which the queries
will continue to be representative even when chosen adaptively: The only
requirements are that each query takes as input a random subsample and outputs
few bits. This result shows that the noise inherent in subsampling is
sufficient to guarantee that query responses generalize. The simplicity of this
subsampling-based framework allows it to model a variety of real-world
scenarios not covered by prior work.
In addition to its simplicity, we demonstrate the utility of this framework
by designing mechanisms for two foundational tasks, statistical queries and
median finding. In particular, our mechanism for answering the broadly
applicable class of statistical queries is both extremely simple and state of
the art in many parameter regimes.
Related papers
- MeaeQ: Mount Model Extraction Attacks with Efficient Queries [6.1106195466129485]
We study model extraction attacks in natural language processing (NLP)
We propose MeaeQ, a straightforward yet effective method to address these issues.
MeaeQ achieves higher functional similarity to the victim model than baselines while requiring fewer queries.
arXiv Detail & Related papers (2023-10-21T16:07:16Z) - Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs)
This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias"
We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z) - A Statistical View of Column Subset Selection [91.3755431537592]
We consider the problem of selecting a small subset of representative variables from a large dataset.
We show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.
arXiv Detail & Related papers (2023-07-24T15:42:33Z) - Adaptive Data Analysis in a Balanced Adversarial Model [26.58630744414181]
In adaptive data analysis, a mechanism gets $n$ i.i.d. samples from an unknown distribution $D$, and is required to provide accurate estimations.
We consider more restricted adversaries, called emphbalanced, where each such adversary consists of two separated algorithms.
We show that these stronger hardness assumptions are unavoidable in the sense that any computationally bounded emphbalanced adversary implies the existence of public-key cryptography.
arXiv Detail & Related papers (2023-05-24T15:08:05Z) - MQAG: Multiple-choice Question Answering and Generation for Assessing
Information Consistency in Summarization [55.60306377044225]
State-of-the-art summarization systems can generate highly fluent summaries.
These summaries, however, may contain factual inconsistencies and/or information not present in the source.
We introduce an alternative scheme based on standard information-theoretic measures in which the information present in the source and summary is directly compared.
arXiv Detail & Related papers (2023-01-28T23:08:25Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Beyond Accuracy: A Consolidated Tool for Visual Question Answering
Benchmarking [30.155625852894797]
We propose a browser-based benchmarking tool for researchers and challenge organizers.
Our tool helps test generalization capabilities of models across multiple datasets.
Interactive filtering facilitates discovery of problematic behavior.
arXiv Detail & Related papers (2021-10-11T11:08:35Z) - Generalization in the Face of Adaptivity: A Bayesian Perspective [3.0202264016476623]
Repeated use of a data sample via adaptively chosen queries can rapidly lead to overfitting.
It turns out that simple noise unbounded addition algorithms suffice to prevent this issue.
We show that the harm of adaptivity results from the covariance between the new query and a Bayes factor-based measure of how much information about the data sample was encoded in the responses given to past queries.
arXiv Detail & Related papers (2021-06-20T22:06:44Z) - Approximate Query Processing for Group-By Queries based on Conditional
Generative Models [3.9837198605506963]
Group-by query involves multiple values, which makes it difficult to provide sufficiently accurate estimations for all the groups.
Stratified sampling improves the accuracy compared with the uniform sampling, but the samples chosen for some special queries cannot work for other queries.
Online sampling chooses samples for the given query at query time, but it requires a long latency.
The proposed framework can be combined with stratified sampling and online aggregation to improve the estimation accuracy for group-by queries.
arXiv Detail & Related papers (2021-01-08T08:49:21Z) - Abstractive Query Focused Summarization with Query-Free Resources [60.468323530248945]
In this work, we consider the problem of leveraging only generic summarization resources to build an abstractive QFS system.
We propose Marge, a Masked ROUGE Regression framework composed of a novel unified representation for summaries and queries.
Despite learning from minimal supervision, our system achieves state-of-the-art results in the distantly supervised setting.
arXiv Detail & Related papers (2020-12-29T14:39:35Z) - One for More: Selecting Generalizable Samples for Generalizable ReID
Model [92.40951770273972]
This paper proposes a one-for-more training objective that takes the generalization ability of selected samples as a loss function.
Our proposed one-for-more based sampler can be seamlessly integrated into the ReID training framework.
arXiv Detail & Related papers (2020-12-10T06:37:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.