Related papers: Subsampling Suffices for Adaptive Data Analysis

Subsampling Suffices for Adaptive Data Analysis

URL: http://arxiv.org/abs/2302.08661v2
Date: Wed, 20 Sep 2023 20:37:55 GMT
Title: Subsampling Suffices for Adaptive Data Analysis
Authors: Guy Blanc
Abstract summary: Most classical techniques assume that the dataset is independent of the analyst's query and break down in the common setting where a dataset is reused for multiple, adaptively chosen, queries. We identify a remarkably simple set of assumptions under which the queries will continue to be representative even when chosen adaptively. The simplicity of this subsampling-based framework allows it to model a variety of real-world scenarios not covered by prior work.
Score: 8.231050911072755
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Ensuring that analyses performed on a dataset are representative of the entire population is one of the central problems in statistics. Most classical techniques assume that the dataset is independent of the analyst's query and break down in the common setting where a dataset is reused for multiple, adaptively chosen, queries. This problem of \emph{adaptive data analysis} was formalized in the seminal works of Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014). We identify a remarkably simple set of assumptions under which the queries will continue to be representative even when chosen adaptively: The only requirements are that each query takes as input a random subsample and outputs few bits. This result shows that the noise inherent in subsampling is sufficient to guarantee that query responses generalize. The simplicity of this subsampling-based framework allows it to model a variety of real-world scenarios not covered by prior work. In addition to its simplicity, we demonstrate the utility of this framework by designing mechanisms for two foundational tasks, statistical queries and median finding. In particular, our mechanism for answering the broadly applicable class of statistical queries is both extremely simple and state of the art in many parameter regimes.

Related papers

Tight Bounds for Answering Adaptively Chosen Concentrated Queries [24.243931410934323]
In this work, we prove that this utility gap is inherent under the current formulation of the concentrated queries framework.<n>We present a simplified version of the best-known algorithms that match our impossibility result.
arXiv Detail & Related papers (2025-07-18T07:08:42Z)
Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity [59.57065228857247]
Retrieval-augmented Large Language Models (LLMs) have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA) We propose a novel adaptive QA framework, that can dynamically select the most suitable strategy for (retrieval-augmented) LLMs based on the query complexity. We validate our model on a set of open-domain QA datasets, covering multiple query complexities, and show that ours enhances the overall efficiency and accuracy of QA systems.
arXiv Detail & Related papers (2024-03-21T13:52:30Z)
MeaeQ: Mount Model Extraction Attacks with Efficient Queries [6.1106195466129485]
We study model extraction attacks in natural language processing (NLP) We propose MeaeQ, a straightforward yet effective method to address these issues. MeaeQ achieves higher functional similarity to the victim model than baselines while requiring fewer queries.
arXiv Detail & Related papers (2023-10-21T16:07:16Z)
Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs) This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias" We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z)
Adaptive Data Analysis in a Balanced Adversarial Model [26.58630744414181]
In adaptive data analysis, a mechanism gets $n$ i.i.d. samples from an unknown distribution $D$, and is required to provide accurate estimations. We consider more restricted adversaries, called emphbalanced, where each such adversary consists of two separated algorithms. We show that these stronger hardness assumptions are unavoidable in the sense that any computationally bounded emphbalanced adversary implies the existence of public-key cryptography.
arXiv Detail & Related papers (2023-05-24T15:08:05Z)
MQAG: Multiple-choice Question Answering and Generation for Assessing Information Consistency in Summarization [55.60306377044225]
State-of-the-art summarization systems can generate highly fluent summaries. These summaries, however, may contain factual inconsistencies and/or information not present in the source. We introduce an alternative scheme based on standard information-theoretic measures in which the information present in the source and summary is directly compared.
arXiv Detail & Related papers (2023-01-28T23:08:25Z)
Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z)
Beyond Accuracy: A Consolidated Tool for Visual Question Answering Benchmarking [30.155625852894797]
We propose a browser-based benchmarking tool for researchers and challenge organizers. Our tool helps test generalization capabilities of models across multiple datasets. Interactive filtering facilitates discovery of problematic behavior.
arXiv Detail & Related papers (2021-10-11T11:08:35Z)
Generalization in the Face of Adaptivity: A Bayesian Perspective [3.0202264016476623]
Repeated use of a data sample via adaptively chosen queries can rapidly lead to overfitting. It turns out that simple noise unbounded addition algorithms suffice to prevent this issue. We show that the harm of adaptivity results from the covariance between the new query and a Bayes factor-based measure of how much information about the data sample was encoded in the responses given to past queries.
arXiv Detail & Related papers (2021-06-20T22:06:44Z)
Approximate Query Processing for Group-By Queries based on Conditional Generative Models [3.9837198605506963]
Group-by query involves multiple values, which makes it difficult to provide sufficiently accurate estimations for all the groups. Stratified sampling improves the accuracy compared with the uniform sampling, but the samples chosen for some special queries cannot work for other queries. Online sampling chooses samples for the given query at query time, but it requires a long latency. The proposed framework can be combined with stratified sampling and online aggregation to improve the estimation accuracy for group-by queries.
arXiv Detail & Related papers (2021-01-08T08:49:21Z)
Abstractive Query Focused Summarization with Query-Free Resources [60.468323530248945]
In this work, we consider the problem of leveraging only generic summarization resources to build an abstractive QFS system. We propose Marge, a Masked ROUGE Regression framework composed of a novel unified representation for summaries and queries. Despite learning from minimal supervision, our system achieves state-of-the-art results in the distantly supervised setting.
arXiv Detail & Related papers (2020-12-29T14:39:35Z)
One for More: Selecting Generalizable Samples for Generalizable ReID Model [92.40951770273972]
This paper proposes a one-for-more training objective that takes the generalization ability of selected samples as a loss function. Our proposed one-for-more based sampler can be seamlessly integrated into the ReID training framework.
arXiv Detail & Related papers (2020-12-10T06:37:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.