Capturing patterns of variation unique to a specific dataset
- URL: http://arxiv.org/abs/2104.08157v1
- Date: Fri, 16 Apr 2021 15:07:32 GMT
- Title: Capturing patterns of variation unique to a specific dataset
- Authors: Robin Tu, Alexander H. Foss, Sihai D. Zhao
- Abstract summary: We propose a tuning-free method that identifies low-dimensional representations of a target dataset relative to one or more comparison datasets.
We show in several experiments that UCA with a single background dataset achieves similar results compared to cPCA with various tuning parameters.
- Score: 68.8204255655161
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Capturing patterns of variation present in a dataset is important in
exploratory data analysis and unsupervised learning. Contrastive dimension
reduction methods, such as contrastive principal component analysis (cPCA),
find patterns unique to a target dataset of interest by contrasting with a
carefully chosen background dataset representing unwanted or uninteresting
variation. However, such methods typically require a tuning parameter that
governs the level of contrast, and it is unclear how to choose this parameter
objectively. Furthermore, it is frequently of interest to contrast against
multiple backgrounds, which is difficult to accomplish with existing methods.
We propose unique component analysis (UCA), a tuning-free method that
identifies low-dimensional representations of a target dataset relative to one
or more comparison datasets. It is computationally efficient even with large
numbers of features. We show in several experiments that UCA with a single
background dataset achieves similar results compared to cPCA with various
tuning parameters, and that UCA with multiple individual background datasets is
superior to both cPCA with any single background data and cPCA with a pooled
background dataset.
Related papers
- Preference Optimization with Multi-Sample Comparisons [53.02717574375549]
We introduce a novel approach that extends post-training to include multi-sample comparisons.
These approaches fail to capture critical characteristics such as generative diversity and bias.
We demonstrate that multi-sample comparison is more effective in optimizing collective characteristics than single-sample comparison.
arXiv Detail & Related papers (2024-10-16T00:59:19Z) - RepMatch: Quantifying Cross-Instance Similarities in Representation Space [15.215985417763472]
We introduce RepMatch, a novel method that characterizes data through the lens of similarity.
RepMatch quantifies the similarity between subsets of training instances by comparing the knowledge encoded in models trained on them.
We validate the effectiveness of RepMatch across multiple NLP tasks, datasets, and models.
arXiv Detail & Related papers (2024-10-12T20:42:28Z) - Quantifying Spuriousness of Biased Datasets Using Partial Information Decomposition [14.82261635235695]
Spurious patterns refer to a mathematical association between two or more variables in a dataset that are not causally related.
This work presents the first information-theoretic formalization of spuriousness in a dataset (given a split of spurious and core features) using a mathematical framework called Partial Information Decomposition (PID)
We disentangle the joint information content that the spurious and core features share about another target variable into distinct components, namely unique, redundant, and synergistic information.
arXiv Detail & Related papers (2024-06-29T16:05:47Z) - Diversity Measurement and Subset Selection for Instruction Tuning
Datasets [40.930387018872786]
We use determinantal point processes to capture the diversity and quality of instruction tuning datasets for subset selection.
We propose to measure dataset diversity with log determinant distance that is the distance between the dataset of interest and a maximally diverse reference dataset.
arXiv Detail & Related papers (2024-02-04T02:09:43Z) - SepVAE: a contrastive VAE to separate pathological patterns from healthy ones [2.619659560375341]
Contrastive Analysis VAE (CA-VAEs) is a family of Variational auto-encoders (VAEs) that aims at separating the common factors of variation between a background dataset (BG) and a target dataset (TG)
We show a better performance than previous CA-VAEs methods on three medical applications and a natural images dataset (CelebA)
arXiv Detail & Related papers (2023-07-12T14:52:21Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - Auditing for Diversity using Representative Examples [17.016881905579044]
We propose a cost-effective approach to approximate the disparity of a given unlabeled dataset.
Our proposed algorithm uses the pairwise similarity between elements in the dataset and elements in the control set to effectively bootstrap an approximation.
We show that using a control set whose size is much smaller than the size of the dataset is sufficient to achieve a small approximation error.
arXiv Detail & Related papers (2021-07-15T15:21:17Z) - Multi-dataset Pretraining: A Unified Model for Semantic Segmentation [97.61605021985062]
We propose a unified framework, termed as Multi-Dataset Pretraining, to take full advantage of the fragmented annotations of different datasets.
This is achieved by first pretraining the network via the proposed pixel-to-prototype contrastive loss over multiple datasets.
In order to better model the relationship among images and classes from different datasets, we extend the pixel level embeddings via cross dataset mixing.
arXiv Detail & Related papers (2021-06-08T06:13:11Z) - Multi-Source Causal Inference Using Control Variates [81.57072928775509]
We propose a general algorithm to estimate causal effects from emphmultiple data sources.
We show theoretically that this reduces the variance of the ATE estimate.
We apply this framework to inference from observational data under an outcome selection bias.
arXiv Detail & Related papers (2021-03-30T21:20:51Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.