Two-sample test based on Self-Organizing Maps
- URL: http://arxiv.org/abs/2212.08960v1
- Date: Sat, 17 Dec 2022 21:35:47 GMT
- Title: Two-sample test based on Self-Organizing Maps
- Authors: Alejandro \'Alvarez-Ayll\'on, Manuel Palomo-Duarte, Juan-Manuel Dodero
- Abstract summary: Machine-learning classifiers can be leveraged as a two-sample statistical test.
Self-Organizing Maps are a dimensionality reduction initially devised as a data visualization tool.
But since their original purpose is visualization, they can also offer insights.
- Score: 68.8204255655161
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine-learning classifiers can be leveraged as a two-sample statistical
test. Suppose each sample is assigned a different label and that a classifier
can obtain a better-than-chance result discriminating them. In this case, we
can infer that both samples originate from different populations. However, many
types of models, such as neural networks, behave as a black-box for the user:
they can reject that both samples originate from the same population, but they
do not offer insight into how both samples differ. Self-Organizing Maps are a
dimensionality reduction initially devised as a data visualization tool that
displays emergent properties, being also useful for classification tasks. Since
they can be used as classifiers, they can be used also as a two-sample
statistical test. But since their original purpose is visualization, they can
also offer insights.
Related papers
- Sample-Specific Debiasing for Better Image-Text Models [6.301766237907306]
Self-supervised representation learning on image-text data facilitates crucial medical applications, such as image classification, visual grounding, and cross-modal retrieval.
One common approach involves contrasting semantically similar (positive) and dissimilar (negative) pairs of data points.
Drawing negative samples uniformly from the training data set introduces false negatives, i.e., samples that are treated as dissimilar but belong to the same class.
In healthcare data, the underlying class distribution is nonuniform, implying that false negatives occur at a highly variable rate.
arXiv Detail & Related papers (2023-04-25T22:23:41Z) - Active Sequential Two-Sample Testing [18.99517340397671]
We consider the two-sample testing problem in a new scenario where sample measurements are inexpensive to access.
We devise the first emphactiveNIST-sample testing framework that not only sequentially but also emphactively queries.
In practice, we introduce an instantiation of our framework and evaluate it using several experiments.
arXiv Detail & Related papers (2023-01-30T02:23:49Z) - Estimating Structural Disparities for Face Models [54.062512989859265]
In machine learning, disparity metrics are often defined by measuring the difference in the performance or outcome of a model, across different sub-populations.
We explore performing such analysis on computer vision models trained on human faces, and on tasks such as face attribute prediction and affect estimation.
arXiv Detail & Related papers (2022-04-13T05:30:53Z) - Understanding, Detecting, and Separating Out-of-Distribution Samples and
Adversarial Samples in Text Classification [80.81532239566992]
We compare the two types of anomalies (OOD and Adv samples) with the in-distribution (ID) ones from three aspects.
We find that OOD samples expose their aberration starting from the first layer, while the abnormalities of Adv samples do not emerge until the deeper layers of the model.
We propose a simple method to separate ID, OOD, and Adv samples using the hidden representations and output probabilities of the model.
arXiv Detail & Related papers (2022-04-09T12:11:59Z) - Label efficient two-sample test [39.0914588747459]
Two-sample tests evaluate whether two samples are realizations of the same distribution (the null hypothesis) or two different distributions (the alternative hypothesis)
In this paper, we consider this important variation on the classical two-sample test problem and pose it as a problem of obtaining the labels of only a small number of samples in service of performing a two-sample test.
We devise a label efficient three-stage framework: firstly, a classifier is trained with samples uniformly labeled to model the posterior probabilities of the labels; secondly, an innovative query scheme dubbed emphbimodal query is used to query labels
arXiv Detail & Related papers (2021-11-17T01:55:01Z) - Sampling from Arbitrary Functions via PSD Models [55.41644538483948]
We take a two-step approach by first modeling the probability distribution and then sampling from that model.
We show that these models can approximate a large class of densities concisely using few evaluations, and present a simple algorithm to effectively sample from these models.
arXiv Detail & Related papers (2021-10-20T12:25:22Z) - Does the dataset meet your expectations? Explaining sample
representation in image data [0.0]
A neural network model is affected adversely by a lack of diversity in training data.
We present a method that identifies and explains such deficiencies.
We then apply the method to examine a dataset of geometric shapes.
arXiv Detail & Related papers (2020-12-06T18:16:28Z) - Understanding Classifier Mistakes with Generative Models [88.20470690631372]
Deep neural networks are effective on supervised learning tasks, but have been shown to be brittle.
In this paper, we leverage generative models to identify and characterize instances where classifiers fail to generalize.
Our approach is agnostic to class labels from the training set which makes it applicable to models trained in a semi-supervised way.
arXiv Detail & Related papers (2020-10-05T22:13:21Z) - Debiased Contrastive Learning [64.98602526764599]
We develop a debiased contrastive objective that corrects for the sampling of same-label datapoints.
Empirically, the proposed objective consistently outperforms the state-of-the-art for representation learning in vision, language, and reinforcement learning benchmarks.
arXiv Detail & Related papers (2020-07-01T04:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.