Synthetic-Powered Multiple Testing with FDR Control
- URL: http://arxiv.org/abs/2602.16690v1
- Date: Wed, 18 Feb 2026 18:36:24 GMT
- Title: Synthetic-Powered Multiple Testing with FDR Control
- Authors: Yonghoon Lee, Meshi Bashari, Edgar Dobriban, Yaniv Romano,
- Abstract summary: We introduce SynthBH, a synthetic-powered multiple testing procedure that safely leverages synthetic data.<n>We prove that SynthBH guarantees finite-sample, distribution-free FDR control under a mild PRDS-type positive dependence condition.<n>It enhances the sample efficiency and may boost the power when synthetic data are of high quality, while controlling the FDR at a user-specified level regardless of their quality.
- Score: 29.516221063294157
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multiple hypothesis testing with false discovery rate (FDR) control is a fundamental problem in statistical inference, with broad applications in genomics, drug screening, and outlier detection. In many such settings, researchers may have access not only to real experimental observations but also to auxiliary or synthetic data -- from past, related experiments or generated by generative models -- that can provide additional evidence about the hypotheses of interest. We introduce SynthBH, a synthetic-powered multiple testing procedure that safely leverages such synthetic data. We prove that SynthBH guarantees finite-sample, distribution-free FDR control under a mild PRDS-type positive dependence condition, without requiring the pooled-data p-values to be valid under the null. The proposed method adapts to the (unknown) quality of the synthetic data: it enhances the sample efficiency and may boost the power when synthetic data are of high quality, while controlling the FDR at a user-specified level regardless of their quality. We demonstrate the empirical performance of SynthBH on tabular outlier detection benchmarks and on genomic analyses of drug-cancer sensitivity associations, and further study its properties through controlled experiments on simulated data.
Related papers
- Cross-Validated Causal Inference: a Modern Method to Combine Experimental and Observational Data [48.72384067821617]
We develop new methods to integrate experimental and observational data in causal inference.<n>A full model containing the causal parameter is obtained by minimizing a weighted combination of experimental and observational losses.<n>Experiments on real and synthetic data show the efficacy and reliability of our method.
arXiv Detail & Related papers (2025-11-01T22:24:16Z) - Statistical Inference Leveraging Synthetic Data with Distribution-Free Guarantees [27.512077526249524]
High-quality synthetic data presents both opportunities and challenges for statistical inference.<n>This paper introduces a GEneral Synthetic-Powered Inference framework that wraps around any statistical inference procedure.<n>Our framework leverages high-quality synthetic data to boost statistical power, yet adaptively defaults to the standard inference method using only real data.
arXiv Detail & Related papers (2025-09-24T17:37:14Z) - Valid Inference with Imperfect Synthetic Data [39.10587411316875]
We introduce a new estimator based on generalized method of moments.<n>We find that interactions between the moment residuals of synthetic data and those of real data can greatly improve estimates of the target parameter.
arXiv Detail & Related papers (2025-08-08T18:32:52Z) - A Sample Efficient Conditional Independence Test in the Presence of Discretization [54.047334792855345]
Conditional Independence (CI) tests directly to discretized data can lead to incorrect conclusions.<n>Recent advancements have sought to infer the correct CI relationship between the latent variables through binarizing observed data.<n>Motivated by this, this paper introduces a sample-efficient CI test that does not rely on the binarization process.
arXiv Detail & Related papers (2025-06-10T12:41:26Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Does Differentially Private Synthetic Data Lead to Synthetic Discoveries? [1.9573380763700712]
The evaluation is conducted in terms of the tests' Type I and Type II errors.
A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at privacy budget levels of $epsilonleq 1$.
arXiv Detail & Related papers (2024-03-20T14:03:57Z) - The Real Deal Behind the Artificial Appeal: Inferential Utility of Tabular Synthetic Data [40.165159490379146]
We show that the rate of false-positive findings (type 1 error) will be unacceptably high, even when the estimates are unbiased.
Despite the use of a previously proposed correction factor, this problem persists for deep generative models.
arXiv Detail & Related papers (2023-12-13T02:04:41Z) - Can You Rely on Your Model Evaluation? Improving Model Evaluation with
Synthetic Test Data [75.20035991513564]
We introduce 3S Testing, a deep generative modeling framework to facilitate model evaluation.
Our experiments demonstrate that 3S Testing outperforms traditional baselines.
These results raise the question of whether we need a paradigm shift away from limited real test data towards synthetic test data.
arXiv Detail & Related papers (2023-10-25T10:18:44Z) - On Synthetic Data for Back Translation [66.6342561585953]
Back translation (BT) is one of the most significant technologies in NMT research fields.
We identify two key factors on synthetic data controlling the back-translation NMT performance, which are quality and importance.
We propose a simple yet effective method to generate synthetic data to better trade off both factors so as to yield a better performance for BT.
arXiv Detail & Related papers (2023-10-20T17:24:12Z) - Perturbation-Assisted Sample Synthesis: A Novel Approach for Uncertainty
Quantification [3.175239447683357]
This paper introduces a novel Perturbation-Assisted Inference (PAI) framework utilizing synthetic data generated by the Perturbation-Assisted Sample Synthesis (PASS) method.
The framework focuses on uncertainty quantification in complex data scenarios, particularly involving unstructured data.
We demonstrate the effectiveness of PAI in advancing uncertainty quantification in complex, data-driven tasks by applying it to diverse areas such as image synthesis, sentiment word analysis, multimodal inference, and the construction of prediction intervals.
arXiv Detail & Related papers (2023-05-30T01:01:36Z) - Epistemic Parity: Reproducibility as an Evaluation Metric for
Differential Privacy [9.755020926517291]
We propose an evaluation methodology for synthetic data that avoids assumptions about the representativeness of proxy tasks.
We measure the likelihood that published conclusions would change had the authors used synthetic data.
We advocate for a new class of mechanisms that favor stronger utility guarantees and offer privacy protection.
arXiv Detail & Related papers (2022-08-26T14:57:21Z) - BeCAPTCHA-Type: Biometric Keystroke Data Generation for Improved Bot
Detection [63.447493500066045]
This work proposes a data driven learning model for the synthesis of keystroke biometric data.
The proposed method is compared with two statistical approaches based on Universal and User-dependent models.
Our experimental framework considers a dataset with 136 million keystroke events from 168 thousand subjects.
arXiv Detail & Related papers (2022-07-27T09:26:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.