Machine Learning for Two-Sample Testing under Right-Censored Data: A Simulation Study
- URL: http://arxiv.org/abs/2409.08201v2
- Date: Thu, 26 Sep 2024 14:56:57 GMT
- Title: Machine Learning for Two-Sample Testing under Right-Censored Data: A Simulation Study
- Authors: Petr Philonenko, Sergey Postovalov,
- Abstract summary: This study is to evaluate the effectiveness of Machine Learning (ML) methods for two-sample testing with right-censored observations.
In total, this work covers 18 methods for two-sample testing under right-censored observations.
To test the two-sample problem with right-censored observations, one can use the proposed two-sample methods (scripts, dataset, and models are available on GitHub and Hugging Face)
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The focus of this study is to evaluate the effectiveness of Machine Learning (ML) methods for two-sample testing with right-censored observations. To achieve this, we develop several ML-based methods with varying architectures and implement them as two-sample tests. Each method is an ensemble (stacking) that combines predictions from classical two-sample tests. This paper presents the results of training the proposed ML methods, examines their statistical power compared to classical two-sample tests, analyzes the null distribution of the proposed methods when the null hypothesis is true, and evaluates the significance of the features incorporated into the proposed methods. In total, this work covers 18 methods for two-sample testing under right-censored observations, including the proposed methods and classical well-studied two-sample tests. All results from numerical experiments were obtained from a synthetic dataset generated using the inverse transform sampling method and replicated multiple times through Monte Carlo simulation. To test the two-sample problem with right-censored observations, one can use the proposed two-sample methods (scripts, dataset, and models are available on GitHub and Hugging Face).
Related papers
- Comparing Generative Models with the New Physics Learning Machine [0.0]
In large-scale and high-dimensional regimes, machine learning offers a set of tools to push beyond the limitations of standard statistical techniques.<n>We put this claim to the test by comparing a proposal from the high-energy physics literature, the New Physics Learning Machine, to perform a classification-based two-sample test.<n>We highlight the efficiency tradeoffs of the method and the computational costs that come from adopting learning-based approaches.
arXiv Detail & Related papers (2025-08-04T10:42:52Z) - Two-cluster test [1.871954330708119]
We introduce the two-cluster test issue and argue that it is a totally different significance testing issue from conventional two-sample test.<n>Experiments on both synthetic and real data sets show that the proposed test is able to significantly reduce the Type-I error rate.<n>More importantly, the practical usage of such two-cluster test is further verified through its applications in tree-based interpretable clustering and significance-based hierarchical clustering.
arXiv Detail & Related papers (2025-07-11T07:54:16Z) - A Kernel-Based Conditional Two-Sample Test Using Nearest Neighbors (with Applications to Calibration, Regression Curves, and Simulation-Based Inference) [3.622435665395788]
We introduce a kernel-based measure for detecting differences between two conditional distributions.
When the two conditional distributions are the same, the estimate has a Gaussian limit and its variance has a simple form that can be easily estimated from the data.
We also provide a resampling based test using our estimate that applies to the conditional goodness-of-fit problem.
arXiv Detail & Related papers (2024-07-23T15:04:38Z) - CKD: Contrastive Knowledge Distillation from A Sample-wise Perspective [48.99488315273868]
We present a contrastive knowledge distillation approach, which can be formulated as a sample-wise alignment problem with intra- and inter-sample constraints.
Our method minimizes logit differences within the same sample by considering their numerical values.
We conduct comprehensive experiments on three datasets including CIFAR-100, ImageNet-1K, and MS COCO.
arXiv Detail & Related papers (2024-04-22T11:52:40Z) - Variable Selection in Maximum Mean Discrepancy for Interpretable
Distribution Comparison [9.12501922682336]
Two-sample testing decides whether two datasets are generated from the same distribution.
This paper studies variable selection for two-sample testing, the task being to identify the variables responsible for the discrepancies between the two distributions.
arXiv Detail & Related papers (2023-11-02T18:38:39Z) - A framework for paired-sample hypothesis testing for high-dimensional
data [7.400168551191579]
We put forward the idea that scoring functions can be produced by the decision rules defined by the bisecting hyperplanes of the line segments connecting each pair of instances.
First, we estimate the bisecting hyperplanes for each pair of instances and an aggregated rule derived through the Hodges-Lehmann estimator.
arXiv Detail & Related papers (2023-09-28T09:17:11Z) - Structured Voronoi Sampling [61.629198273926676]
In this paper, we take an important step toward building a principled approach for sampling from language models with gradient-based methods.
We name our gradient-based technique Structured Voronoi Sampling (SVS)
In a controlled generation task, SVS is able to generate fluent and diverse samples while following the control targets significantly better than other methods.
arXiv Detail & Related papers (2023-06-05T17:32:35Z) - E-Valuating Classifier Two-Sample Tests [11.248868528186332]
Our test combines ideas from existing work-valid split likelihood ratio tests and predictive independence tests.
The resulting E-values are suitable for anytime sequential two-sample tests.
arXiv Detail & Related papers (2022-10-24T08:18:36Z) - Sampling from Arbitrary Functions via PSD Models [55.41644538483948]
We take a two-step approach by first modeling the probability distribution and then sampling from that model.
We show that these models can approximate a large class of densities concisely using few evaluations, and present a simple algorithm to effectively sample from these models.
arXiv Detail & Related papers (2021-10-20T12:25:22Z) - Empowering Language Understanding with Counterfactual Reasoning [141.48592718583245]
We propose a Counterfactual Reasoning Model, which mimics the counterfactual thinking by learning from few counterfactual samples.
In particular, we devise a generation module to generate representative counterfactual samples for each factual sample, and a retrospective module to retrospect the model prediction by comparing the counterfactual and factual samples.
arXiv Detail & Related papers (2021-06-06T06:36:52Z) - Doubly Contrastive Deep Clustering [135.7001508427597]
We present a novel Doubly Contrastive Deep Clustering (DCDC) framework, which constructs contrastive loss over both sample and class views.
Specifically, for the sample view, we set the class distribution of the original sample and its augmented version as positive sample pairs.
For the class view, we build the positive and negative pairs from the sample distribution of the class.
In this way, two contrastive losses successfully constrain the clustering results of mini-batch samples in both sample and class level.
arXiv Detail & Related papers (2021-03-09T15:15:32Z) - Two-Sample Testing on Ranked Preference Data and the Role of Modeling
Assumptions [57.77347280992548]
In this paper, we design two-sample tests for pairwise comparison data and ranking data.
Our test requires essentially no assumptions on the distributions.
By applying our two-sample test on real-world pairwise comparison data, we conclude that ratings and rankings provided by people are indeed distributed differently.
arXiv Detail & Related papers (2020-06-21T20:51:09Z) - On Contrastive Learning for Likelihood-free Inference [20.49671736540948]
Likelihood-free methods perform parameter inference in simulator models where evaluating the likelihood is intractable.
One class of methods for this likelihood-free problem uses a classifier to distinguish between pairs of parameter-observation samples.
Another popular class of methods fits a conditional distribution to the parameter posterior directly, and a particular recent variant allows for the use of flexible neural density estimators.
arXiv Detail & Related papers (2020-02-10T13:14:01Z) - Clustering Binary Data by Application of Combinatorial Optimization
Heuristics [52.77024349608834]
We study clustering methods for binary data, first defining aggregation criteria that measure the compactness of clusters.
Five new and original methods are introduced, using neighborhoods and population behavior optimization metaheuristics.
From a set of 16 data tables generated by a quasi-Monte Carlo experiment, a comparison is performed for one of the aggregations using L1 dissimilarity, with hierarchical clustering, and a version of k-means: partitioning around medoids or PAM.
arXiv Detail & Related papers (2020-01-06T23:33:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.