Related papers: An Epistemic and Aleatoric Decomposition of Arbitrariness to Constrain the Set of Good Models

An Epistemic and Aleatoric Decomposition of Arbitrariness to Constrain the Set of Good Models

URL: http://arxiv.org/abs/2302.04525v2
Date: Sat, 12 Jul 2025 07:10:35 GMT
Title: An Epistemic and Aleatoric Decomposition of Arbitrariness to Constrain the Set of Good Models
Authors: Falaah Arif Khan, Denys Herasymuk, Nazar Protsiv, Julia Stoyanovich,
Abstract summary: Recent research reveals that machine learning (ML) models are highly sensitive to minor changes in their training procedure.<n>We show that stability decomposes into epistemic and aleatoric components, capturing the consistency and confidence in prediction.<n>We propose a model selection procedure that includes epistemic and aleatoric criteria alongside existing accuracy and fairness criteria, and show that it successfully narrows down a large set of good models.
Score: 7.620967781722717
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent research reveals that machine learning (ML) models are highly sensitive to minor changes in their training procedure, such as the inclusion or exclusion of a single data point, leading to conflicting predictions on individual data points; a property termed as arbitrariness or instability in ML pipelines in prior work. Drawing from the uncertainty literature, we show that stability decomposes into epistemic and aleatoric components, capturing the consistency and confidence in prediction, respectively. We use this decomposition to provide two main contributions. Our first contribution is an extensive empirical evaluation. We find that (i) epistemic instability can be reduced with more training data whereas aleatoric instability cannot; (ii) state-of-the-art ML models have aleatoric instability as high as 79% and aleatoric instability disparities among demographic groups as high as 29% in popular fairness benchmarks; and (iii) fairness pre-processing interventions generally increase aleatoric instability more than in-processing interventions, and both epistemic and aleatoric instability are highly sensitive to data-processing interventions and model architecture. Our second contribution is a practical solution to the problem of systematic arbitrariness. We propose a model selection procedure that includes epistemic and aleatoric criteria alongside existing accuracy and fairness criteria, and show that it successfully narrows down a large set of good models (50-100 on our datasets) to a handful of stable, fair and accurate ones. We built and publicly released a python library to measure epistemic and aleatoric multiplicity in any ML pipeline alongside existing confusion-matrix-based metrics, providing practitioners with a rich suite of evaluation metrics to use to define a more precise criterion during model selection.

Related papers

CLEAR: Calibrated Learning for Epistemic and Aleatoric Risk [7.755784217796677]
We propose CLEAR, a calibration method with two distinct parameters.<n>We show how it can be used with (i) quantile regression for aleatoric uncertainty and (ii) ensembles drawn from the Predictability-Computability-Stability framework.<n> CLEAR achieves an average improvement of 28.2% and 17.4% in the interval width compared to the two individually calibrated baselines.
arXiv Detail & Related papers (2025-07-10T20:13:00Z)
Why Machine Learning Models Fail to Fully Capture Epistemic Uncertainty [1.6112718683989882]
We make use of a more fine-grained taxonomy of epistemic uncertainty sources in machine learning models.<n>We show that high model bias can lead to misleadingly low estimates of epistemic uncertainty.<n>Common second-order uncertainty methods systematically blur bias-induced errors into aleatoric estimates.
arXiv Detail & Related papers (2025-05-29T14:50:46Z)
Improving Omics-Based Classification: The Role of Feature Selection and Synthetic Data Generation [0.18846515534317262]
This study presents a machine learning based classification framework that integrates feature selection with data augmentation techniques.<n>We show that the proposed pipeline yields cross validated perfomance on small dataset.
arXiv Detail & Related papers (2025-05-06T10:09:50Z)
Whence Is A Model Fair? Fixing Fairness Bugs via Propensity Score Matching [0.49157446832511503]
We investigate whether the way training and testing data are sampled affects the reliability of fairness metrics. Since training and test sets are often randomly sampled from the same population, bias present in the training data may still exist in the test data. We propose FairMatch, a post-processing method that applies propensity score matching to evaluate and mitigate bias.
arXiv Detail & Related papers (2025-04-23T19:28:30Z)
Revisiting the Dataset Bias Problem from a Statistical Perspective [72.94990819287551]
We study the "dataset bias" problem from a statistical standpoint. We identify the main cause of the problem as the strong correlation between a class attribute u and a non-class attribute b. We propose to mitigate dataset bias via either weighting the objective of each sample n by frac1p(u_n|b_n) or sampling that sample with a weight proportional to frac1p(u_n|b_n).
arXiv Detail & Related papers (2024-02-05T22:58:06Z)
Likelihood Ratio Confidence Sets for Sequential Decision Making [51.66638486226482]
We revisit the likelihood-based inference principle and propose to use likelihood ratios to construct valid confidence sequences. Our method is especially suitable for problems with well-specified likelihoods. We show how to provably choose the best sequence of estimators and shed light on connections to online convex optimization.
arXiv Detail & Related papers (2023-11-08T00:10:21Z)
It's an Alignment, Not a Trade-off: Revisiting Bias and Variance in Deep Models [51.66015254740692]
We show that for an ensemble of deep learning based classification models, bias and variance are emphaligned at a sample level. We study this phenomenon from two theoretical perspectives: calibration and neural collapse.
arXiv Detail & Related papers (2023-10-13T17:06:34Z)
Towards Better Certified Segmentation via Diffusion Models [62.21617614504225]
segmentation models can be vulnerable to adversarial perturbations, which hinders their use in critical-decision systems like healthcare or autonomous driving. Recently, randomized smoothing has been proposed to certify segmentation predictions by adding Gaussian noise to the input to obtain theoretical guarantees. In this paper, we address the problem of certifying segmentation prediction using a combination of randomized smoothing and diffusion models.
arXiv Detail & Related papers (2023-06-16T16:30:39Z)
The Decaying Missing-at-Random Framework: Model Doubly Robust Causal Inference with Partially Labeled Data [8.916614661563893]
We introduce a missing-at-random (decaying MAR) framework and associated approaches for doubly robust causal inference.<n>This simultaneously addresses selection bias in the labeling mechanism and the extreme imbalance between labeled and unlabeled groups.<n>To ensure robust causal conclusions, we propose a bias-reduced SS estimator for the average treatment effect.
arXiv Detail & Related papers (2023-05-22T07:37:12Z)
Arbitrariness and Social Prediction: The Confounding Role of Variance in Fair Classification [31.392067805022414]
Variance in predictions across different trained models is a significant, under-explored source of error in fair binary classification. In practice, the variance on some data examples is so large that decisions can be effectively arbitrary. We develop an ensembling algorithm that abstains from classification when a prediction would be arbitrary.
arXiv Detail & Related papers (2023-01-27T06:52:04Z)
D-BIAS: A Causality-Based Human-in-the-Loop System for Tackling Algorithmic Bias [57.87117733071416]
We propose D-BIAS, a visual interactive tool that embodies human-in-the-loop AI approach for auditing and mitigating social biases. A user can detect the presence of bias against a group by identifying unfair causal relationships in the causal network. For each interaction, say weakening/deleting a biased causal edge, the system uses a novel method to simulate a new (debiased) dataset.
arXiv Detail & Related papers (2022-08-10T03:41:48Z)
Evaluating Aleatoric Uncertainty via Conditional Generative Models [15.494774321257939]
We study conditional generative models for aleatoric uncertainty estimation. We introduce two metrics to measure the discrepancy between two conditional distributions. We demonstrate numerically how our metrics provide correct measurements of conditional distributional discrepancies.
arXiv Detail & Related papers (2022-06-09T05:39:04Z)
Fair Group-Shared Representations with Normalizing Flows [68.29997072804537]
We develop a fair representation learning algorithm which is able to map individuals belonging to different groups in a single group. We show experimentally that our methodology is competitive with other fair representation learning algorithms.
arXiv Detail & Related papers (2022-01-17T10:49:49Z)
When in Doubt: Neural Non-Parametric Uncertainty Quantification for Epidemic Forecasting [70.54920804222031]
Most existing forecasting models disregard uncertainty quantification, resulting in mis-calibrated predictions. Recent works in deep neural models for uncertainty-aware time-series forecasting also have several limitations. We model the forecasting task as a probabilistic generative process and propose a functional neural process model called EPIFNP.
arXiv Detail & Related papers (2021-06-07T18:31:47Z)
Model Mis-specification and Algorithmic Bias [0.0]
Machine learning algorithms are increasingly used to inform critical decisions. There is a growing concern about bias, that algorithms may produce uneven outcomes for individuals in different demographic groups. In this work, we measure bias as the difference between mean prediction errors across groups.
arXiv Detail & Related papers (2021-05-31T17:45:12Z)
Counterfactual Invariance to Spurious Correlations: Why and How to Pass Stress Tests [87.60900567941428]
A spurious correlation' is the dependence of a model on some aspect of the input data that an analyst thinks shouldn't matter. In machine learning, these have a know-it-when-you-see-it character. We study stress testing using the tools of causal inference.
arXiv Detail & Related papers (2021-05-31T14:39:38Z)
The statistical advantage of automatic NLG metrics at the system level [10.540821585237222]
Statistically, humans are unbiased, high variance estimators, while metrics are biased, low variance estimators. We compare these estimators by their error in pairwise prediction (which generation system is better?) using the bootstrap. Our analysis compares the adjusted error of metrics to humans and a derived, perfect segment-level annotator, both of which are unbiased estimators dependent on the number of judgments collected.
arXiv Detail & Related papers (2021-05-26T09:53:57Z)
Characterizing Fairness Over the Set of Good Models Under Selective Labels [69.64662540443162]
We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance. We provide tractable algorithms to compute the range of attainable group-level predictive disparities. We extend our framework to address the empirically relevant challenge of selectively labelled data.
arXiv Detail & Related papers (2021-01-02T02:11:37Z)
Unlabelled Data Improves Bayesian Uncertainty Calibration under Covariate Shift [100.52588638477862]
We develop an approximate Bayesian inference scheme based on posterior regularisation. We demonstrate the utility of our method in the context of transferring prognostic models of prostate cancer across globally diverse populations.
arXiv Detail & Related papers (2020-06-26T13:50:19Z)
Individual Calibration with Randomized Forecasting [116.2086707626651]
We show that calibration for individual samples is possible in the regression setup if the predictions are randomized. We design a training objective to enforce individual calibration and use it to train randomized regression functions.
arXiv Detail & Related papers (2020-06-18T05:53:10Z)
Is Your Classifier Actually Biased? Measuring Fairness under Uncertainty with Bernstein Bounds [21.598196899084268]
We use Bernstein bounds to represent uncertainty about the bias estimate as a confidence interval. We provide empirical evidence that a 95% confidence interval consistently bounds the true bias. Our findings suggest that the datasets currently used to measure bias are too small to conclusively identify bias except in the most egregious cases.
arXiv Detail & Related papers (2020-04-26T09:45:45Z)
Machine learning for causal inference: on the use of cross-fit estimators [77.34726150561087]
Doubly-robust cross-fit estimators have been proposed to yield better statistical properties. We conducted a simulation study to assess the performance of several estimators for the average causal effect (ACE) When used with machine learning, the doubly-robust cross-fit estimators substantially outperformed all of the other estimators in terms of bias, variance, and confidence interval coverage.
arXiv Detail & Related papers (2020-04-21T23:09:55Z)
Recovering from Biased Data: Can Fairness Constraints Improve Accuracy? [11.435833538081557]
Empirical Risk Minimization (ERM) may produce a classifier that not only is biased but also has suboptimal accuracy on the true data distribution. We examine the ability of fairness-constrained ERM to correct this problem. We also consider other recovery methods including reweighting the training data, Equalized Odds, and Demographic Parity.
arXiv Detail & Related papers (2019-12-02T22:00:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.