Predictive Heterogeneity: Measures and Applications
- URL: http://arxiv.org/abs/2304.00305v1
- Date: Sat, 1 Apr 2023 12:20:06 GMT
- Title: Predictive Heterogeneity: Measures and Applications
- Authors: Jiashuo Liu and Jiayun Wu and Bo Li and Peng Cui
- Abstract summary: We propose the emphusable predictive heterogeneity, which takes into account the model capacity and computational constraints.
We show that it can be reliably estimated from finite data with probably approximately correct (PAC) bounds.
Empirically, the explored heterogeneity provides insights for sub-population divisions in income prediction, crop yield prediction and image classification tasks.
- Score: 26.85283526483783
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As an intrinsic and fundamental property of big data, data heterogeneity
exists in a variety of real-world applications, such as precision medicine,
autonomous driving, financial applications, etc. For machine learning
algorithms, the ignorance of data heterogeneity will greatly hurt the
generalization performance and the algorithmic fairness, since the prediction
mechanisms among different sub-populations are likely to differ from each
other. In this work, we focus on the data heterogeneity that affects the
prediction of machine learning models, and firstly propose the \emph{usable
predictive heterogeneity}, which takes into account the model capacity and
computational constraints. We prove that it can be reliably estimated from
finite data with probably approximately correct (PAC) bounds. Additionally, we
design a bi-level optimization algorithm to explore the usable predictive
heterogeneity from data. Empirically, the explored heterogeneity provides
insights for sub-population divisions in income prediction, crop yield
prediction and image classification tasks, and leveraging such heterogeneity
benefits the out-of-distribution generalization performance.
Related papers
- Ranking and Combining Latent Structured Predictive Scores without Labeled Data [2.5064967708371553]
This paper introduces a novel structured unsupervised ensemble learning model (SUEL)
It exploits the dependency between a set of predictors with continuous predictive scores, rank the predictors without labeled data and combine them to an ensembled score with weights.
The efficacy of the proposed methods is rigorously assessed through both simulation studies and real-world application of risk genes discovery.
arXiv Detail & Related papers (2024-08-14T20:14:42Z) - Collaborative Heterogeneous Causal Inference Beyond Meta-analysis [68.4474531911361]
We propose a collaborative inverse propensity score estimator for causal inference with heterogeneous data.
Our method shows significant improvements over the methods based on meta-analysis when heterogeneity increases.
arXiv Detail & Related papers (2024-04-24T09:04:36Z) - Combining propensity score methods with variational autoencoders for
generating synthetic data in presence of latent sub-groups [0.0]
Heterogeneity might be known, e.g., as indicated by sub-groups labels, or might be unknown and reflected only in properties of distributions, such as bimodality or skewness.
We investigate how such heterogeneity can be preserved and controlled when obtaining synthetic data from variational autoencoders (VAEs), i.e., a generative deep learning technique.
arXiv Detail & Related papers (2023-12-12T22:49:24Z) - A Federated Learning-based Industrial Health Prognostics for
Heterogeneous Edge Devices using Matched Feature Extraction [16.337207503536384]
We propose a pioneering FL-based health prognostic model with a feature similarity-matched parameter aggregation algorithm.
We show that the proposed method yields accuracy improvements as high as 44.5% and 39.3% for state-of-health estimation and remaining useful life estimation.
arXiv Detail & Related papers (2023-05-13T07:20:31Z) - Learning to Bound Counterfactual Inference in Structural Causal Models
from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm.
The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources.
It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Scalable Regularised Joint Mixture Models [2.0686407686198263]
In many applications, data can be heterogeneous in the sense of spanning latent groups with different underlying distributions.
We propose an approach for heterogeneous data that allows joint learning of (i) explicit multivariate feature distributions, (ii) high-dimensional regression models and (iii) latent group labels.
The approach is demonstrably effective in high dimensions, combining data reduction for computational efficiency with a re-weighting scheme that retains key signals even when the number of features is large.
arXiv Detail & Related papers (2022-05-03T13:38:58Z) - Multimodal Data Fusion in High-Dimensional Heterogeneous Datasets via
Generative Models [16.436293069942312]
We are interested in learning probabilistic generative models from high-dimensional heterogeneous data in an unsupervised fashion.
We propose a general framework that combines disparate data types through the exponential family of distributions.
The proposed algorithm is presented in detail for the commonly encountered heterogeneous datasets with real-valued (Gaussian) and categorical (multinomial) features.
arXiv Detail & Related papers (2021-08-27T18:10:31Z) - General stochastic separation theorems with optimal bounds [68.8204255655161]
Phenomenon of separability was revealed and used in machine learning to correct errors of Artificial Intelligence (AI) systems and analyze AI instabilities.
Errors or clusters of errors can be separated from the rest of the data.
The ability to correct an AI system also opens up the possibility of an attack on it, and the high dimensionality induces vulnerabilities caused by the same separability.
arXiv Detail & Related papers (2020-10-11T13:12:41Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z) - Learning Overlapping Representations for the Estimation of
Individualized Treatment Effects [97.42686600929211]
Estimating the likely outcome of alternatives from observational data is a challenging problem.
We show that algorithms that learn domain-invariant representations of inputs are often inappropriate.
We develop a deep kernel regression algorithm and posterior regularization framework that substantially outperforms the state-of-the-art on a variety of benchmarks data sets.
arXiv Detail & Related papers (2020-01-14T12:56:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.