Aggregation as Unsupervised Learning and its Evaluation
- URL: http://arxiv.org/abs/2110.15136v1
- Date: Thu, 28 Oct 2021 14:10:30 GMT
- Title: Aggregation as Unsupervised Learning and its Evaluation
- Authors: Maria Ulan, Welf L\"owe, Morgan Ericsson, Anna Wingkvist
- Abstract summary: We present an empirical evaluation framework that allows assessing the proposed approach against other aggregation approaches.
We use regression data sets from the UCI machine learning repository and benchmark several data-agnostic and unsupervised approaches for aggregation.
The benchmark results indicate that our approach outperforms the other data-agnostic and unsupervised aggregation approaches.
- Score: 9.109147994991229
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Regression uses supervised machine learning to find a model that combines
several independent variables to predict a dependent variable based on ground
truth (labeled) data, i.e., tuples of independent and dependent variables
(labels). Similarly, aggregation also combines several independent variables to
a dependent variable. The dependent variable should preserve properties of the
independent variables, e.g., the ranking or relative distance of the
independent variable tuples, and/or represent a latent ground truth that is a
function of these independent variables. However, ground truth data is not
available for finding the aggregation model. Consequently, aggregation models
are data agnostic or can only be derived with unsupervised machine learning
approaches.
We introduce a novel unsupervised aggregation approach based on intrinsic
properties of unlabeled training data, such as the cumulative probability
distributions of the single independent variables and their mutual
dependencies.
We present an empirical evaluation framework that allows assessing the
proposed approach against other aggregation approaches from two perspectives:
(i) how well the aggregation output represents properties of the input tuples,
and (ii) how well can aggregated output predict a latent ground truth. To this
end, we use data sets for assessing supervised regression approaches that
contain explicit ground truth labels. However, the ground truth is not used for
deriving the aggregation models, but it allows for the assessment from a
perspective (ii). More specifically, we use regression data sets from the UCI
machine learning repository and benchmark several data-agnostic and
unsupervised approaches for aggregation against ours.
The benchmark results indicate that our approach outperforms the other
data-agnostic and unsupervised aggregation approaches. It is almost on par with
linear regression.
Related papers
- Testing Independence of Exchangeable Random Variables [19.973896010415977]
Given well-shuffled data, can we determine whether the data items are statistically (in)dependent?
We will show that this is possible and develop tests that can confidently reject the null hypothesis that data is independent and identically distributed.
One potential application is in Deep Learning, where data is often scraped from the whole internet, with duplications abound.
arXiv Detail & Related papers (2022-10-22T08:55:48Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Equivariance and Invariance Inductive Bias for Learning from
Insufficient Data [65.42329520528223]
We show why insufficient data renders the model more easily biased to the limited training environments that are usually different from testing.
We propose a class-wise invariant risk minimization (IRM) that efficiently tackles the challenge of missing environmental annotation in conventional IRM.
arXiv Detail & Related papers (2022-07-25T15:26:19Z) - On the Strong Correlation Between Model Invariance and Generalization [54.812786542023325]
Generalization captures a model's ability to classify unseen data.
Invariance measures consistency of model predictions on transformations of the data.
From a dataset-centric view, we find a certain model's accuracy and invariance linearly correlated on different test sets.
arXiv Detail & Related papers (2022-07-14T17:08:25Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - CARMS: Categorical-Antithetic-REINFORCE Multi-Sample Gradient Estimator [60.799183326613395]
We propose an unbiased estimator for categorical random variables based on multiple mutually negatively correlated (jointly antithetic) samples.
CARMS combines REINFORCE with copula based sampling to avoid duplicate samples and reduce its variance, while keeping the estimator unbiased using importance sampling.
We evaluate CARMS on several benchmark datasets on a generative modeling task, as well as a structured output prediction task, and find it to outperform competing methods including a strong self-control baseline.
arXiv Detail & Related papers (2021-10-26T20:14:30Z) - Bayesian data combination model with Gaussian process latent variable
model for mixed observed variables under NMAR missingness [0.0]
It is difficult to obtain a "(quasi) single-source dataset" in which the variables of interest are simultaneously observed.
It is necessary to utilize these datasets as a single-source dataset with missing variables.
We propose a data fusion method that does not assume that datasets are homogenous.
arXiv Detail & Related papers (2021-09-01T16:09:55Z) - Statistical Estimation from Dependent Data [37.73584699735133]
We consider a general statistical estimation problem wherein binary labels across different observations are not independent conditioned on their feature vectors.
We model these dependencies in the language of Markov Random Fields.
We provide algorithms and statistically efficient estimation rates for this model.
arXiv Detail & Related papers (2021-07-20T21:18:06Z) - Examining and Combating Spurious Features under Distribution Shift [94.31956965507085]
We define and analyze robust and spurious representations using the information-theoretic concept of minimal sufficient statistics.
We prove that even when there is only bias of the input distribution, models can still pick up spurious features from their training data.
Inspired by our analysis, we demonstrate that group DRO can fail when groups do not directly account for various spurious correlations.
arXiv Detail & Related papers (2021-06-14T05:39:09Z) - NestedVAE: Isolating Common Factors via Weak Supervision [45.366986365879505]
We identify the connection between the task of bias reduction and that of isolating factors common between domains.
To isolate the common factors we combine the theory of deep latent variable models with information bottleneck theory.
Two outer VAEs with shared weights attempt to reconstruct the input and infer a latent space, whilst a nested VAE attempts to reconstruct the latent representation of one image, from the latent representation of its paired image.
arXiv Detail & Related papers (2020-02-26T15:49:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.