Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular
data
- URL: http://arxiv.org/abs/2210.13043v1
- Date: Mon, 24 Oct 2022 08:57:55 GMT
- Title: Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular
data
- Authors: Nabeel Seedat, Jonathan Crabb\'e, Ioana Bica, Mihaela van der Schaar
- Abstract summary: We propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes.
We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets.
- Score: 81.43750358586072
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: High model performance, on average, can hide that models may systematically
underperform on subgroups of the data. We consider the tabular setting, which
surfaces the unique issue of outcome heterogeneity - this is prevalent in areas
such as healthcare, where patients with similar features can have different
outcomes, thus making reliable predictions challenging. To tackle this, we
propose Data-IQ, a framework to systematically stratify examples into subgroups
with respect to their outcomes. We do this by analyzing the behavior of
individual examples during training, based on their predictive confidence and,
importantly, the aleatoric (data) uncertainty. Capturing the aleatoric
uncertainty permits a principled characterization and then subsequent
stratification of data examples into three distinct subgroups (Easy, Ambiguous,
Hard). We experimentally demonstrate the benefits of Data-IQ on four real-world
medical datasets. We show that Data-IQ's characterization of examples is most
robust to variation across similarly performant (yet different) models,
compared to baselines. Since Data-IQ can be used with any ML model (including
neural networks, gradient boosting etc.), this property ensures consistency of
data characterization, while allowing flexible model selection. Taking this a
step further, we demonstrate that the subgroups enable us to construct new
approaches to both feature acquisition and dataset selection. Furthermore, we
highlight how the subgroups can inform reliable model usage, noting the
significant impact of the Ambiguous subgroup on model generalization.
Related papers
- Comparative Analysis of Data Preprocessing Methods, Feature Selection
Techniques and Machine Learning Models for Improved Classification and
Regression Performance on Imbalanced Genetic Data [0.0]
We investigated the effects of data preprocessing, feature selection techniques, and model selection on the performance of models trained on genetic datasets.
We found that outliers/skew in predictor or target variables did not pose a challenge to regression models.
We also found that class-imbalanced target variables and skewed predictors had little to no impact on classification performance.
arXiv Detail & Related papers (2024-02-22T21:41:27Z) - Combining propensity score methods with variational autoencoders for
generating synthetic data in presence of latent sub-groups [0.0]
Heterogeneity might be known, e.g., as indicated by sub-groups labels, or might be unknown and reflected only in properties of distributions, such as bimodality or skewness.
We investigate how such heterogeneity can be preserved and controlled when obtaining synthetic data from variational autoencoders (VAEs), i.e., a generative deep learning technique.
arXiv Detail & Related papers (2023-12-12T22:49:24Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - Composite Feature Selection using Deep Ensembles [130.72015919510605]
We investigate the problem of discovering groups of predictive features without predefined grouping.
We introduce a novel deep learning architecture that uses an ensemble of feature selection models to find predictive groups.
We propose a new metric to measure similarity between discovered groups and the ground truth.
arXiv Detail & Related papers (2022-11-01T17:49:40Z) - Scalable Regularised Joint Mixture Models [2.0686407686198263]
In many applications, data can be heterogeneous in the sense of spanning latent groups with different underlying distributions.
We propose an approach for heterogeneous data that allows joint learning of (i) explicit multivariate feature distributions, (ii) high-dimensional regression models and (iii) latent group labels.
The approach is demonstrably effective in high dimensions, combining data reduction for computational efficiency with a re-weighting scheme that retains key signals even when the number of features is large.
arXiv Detail & Related papers (2022-05-03T13:38:58Z) - Unsupervised Probabilistic Models for Sequential Electronic Health
Records [3.8015092217142223]
The model consists of a layered set of latent variables that encode underlying structure in the data.
We train this model on episodic data from subjects receiving medical care in the Kaiser Permanente Northern California integrated healthcare delivery system.
The resulting properties of the trained model generate novel insight from these complex and multifaceted data.
arXiv Detail & Related papers (2022-04-15T02:11:44Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously.
We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework.
The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.