Related papers: Deep anytime-valid hypothesis testing

Deep anytime-valid hypothesis testing

URL: http://arxiv.org/abs/2310.19384v1
Date: Mon, 30 Oct 2023 09:46:19 GMT
Title: Deep anytime-valid hypothesis testing
Authors: Teodora Pandeva and Patrick Forr\'e and Aaditya Ramdas and Shubhanshu Shekhar
Abstract summary: We propose a general framework for constructing powerful, sequential hypothesis tests for nonparametric testing problems. We develop a principled approach of leveraging the representation capability of machine learning models within the testing-by-betting framework. Empirical results on synthetic and real-world datasets demonstrate that tests instantiated using our general framework are competitive against specialized baselines.
Score: 29.273915933729057
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a general framework for constructing powerful, sequential hypothesis tests for a large class of nonparametric testing problems. The null hypothesis for these problems is defined in an abstract form using the action of two known operators on the data distribution. This abstraction allows for a unified treatment of several classical tasks, such as two-sample testing, independence testing, and conditional-independence testing, as well as modern problems, such as testing for adversarial robustness of machine learning (ML) models. Our proposed framework has the following advantages over classical batch tests: 1) it continuously monitors online data streams and efficiently aggregates evidence against the null, 2) it provides tight control over the type I error without the need for multiple testing correction, 3) it adapts the sample size requirement to the unknown hardness of the problem. We develop a principled approach of leveraging the representation capability of ML models within the testing-by-betting framework, a game-theoretic approach for designing sequential tests. Empirical results on synthetic and real-world datasets demonstrate that tests instantiated using our general framework are competitive against specialized baselines on several tasks.

Related papers

Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models [49.06068319380296]
We introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures. We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures.
arXiv Detail & Related papers (2024-10-31T15:06:16Z)
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists [59.08999823652293]
We propose SYNTHEVAL to generate a wide range of test types for a comprehensive evaluation of NLP models. In the last stage, human experts investigate the challenging examples, manually design templates, and identify the types of failures the taskspecific models consistently exhibit. We apply SYNTHEVAL to two classification tasks, sentiment analysis and toxic language detection, and show that our framework is effective in identifying weaknesses of strong models on these tasks.
arXiv Detail & Related papers (2024-08-30T17:41:30Z)
Precise Error Rates for Computationally Efficient Testing [75.63895690909241]
We revisit the question of simple-versus-simple hypothesis testing with an eye towards computational complexity. An existing test based on linear spectral statistics achieves the best possible tradeoff curve between type I and type II error rates.
arXiv Detail & Related papers (2023-11-01T04:41:16Z)
Can You Rely on Your Model Evaluation? Improving Model Evaluation with Synthetic Test Data [75.20035991513564]
We introduce 3S Testing, a deep generative modeling framework to facilitate model evaluation. Our experiments demonstrate that 3S Testing outperforms traditional baselines. These results raise the question of whether we need a paradigm shift away from limited real test data towards synthetic test data.
arXiv Detail & Related papers (2023-10-25T10:18:44Z)
Validation of massively-parallel adaptive testing using dynamic control matching [0.0]
Modern businesses often run many A/B/n tests at the same time and in parallel, and package many content variations into the same messages. This paper presents a method for disentangling the causal effects of the various tests under conditions of continuous test adaptation.
arXiv Detail & Related papers (2023-05-02T11:28:12Z)
Active Sequential Two-Sample Testing [18.99517340397671]
We consider the two-sample testing problem in a new scenario where sample measurements are inexpensive to access. We devise the first emphactiveNIST-sample testing framework that not only sequentially but also emphactively queries. In practice, we introduce an instantiation of our framework and evaluate it using several experiments.
arXiv Detail & Related papers (2023-01-30T02:23:49Z)
Sequential Kernelized Independence Testing [101.22966794822084]
We design sequential kernelized independence tests inspired by kernelized dependence measures. We demonstrate the power of our approaches on both simulated and real data.
arXiv Detail & Related papers (2022-12-14T18:08:42Z)
Model-Free Sequential Testing for Conditional Independence via Testing by Betting [8.293345261434943]
The proposed test allows researchers to analyze an incoming i.i.d. data stream with any arbitrary dependency structure. We allow the processing of data points online as soon as they arrive and stop data acquisition once significant results are detected.
arXiv Detail & Related papers (2022-10-01T20:05:33Z)
Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control [67.52000805944924]
Learn then Test (LTT) is a framework for calibrating machine learning models. Our main insight is to reframe the risk-control problem as multiple hypothesis testing. We use our framework to provide new calibration methods for several core machine learning tasks with detailed worked examples in computer vision.
arXiv Detail & Related papers (2021-10-03T17:42:03Z)
Significance tests of feature relevance for a blackbox learner [6.72450543613463]
We derive two consistent tests for the feature relevance of a blackbox learner. The first evaluates a loss difference with perturbation on an inference sample. The second splits the inference sample into two but does not require data perturbation.
arXiv Detail & Related papers (2021-03-02T00:59:19Z)
Double Generative Adversarial Networks for Conditional Independence Testing [8.359770027722275]
High-dimensional conditional independence testing is a key building block in statistics and machine learning. We propose an inferential procedure based on double generative adversarial networks (GANs)
arXiv Detail & Related papers (2020-06-03T16:14:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.