Related papers: Learning Probabilities of Causation from Finite Population Data

Learning Probabilities of Causation from Finite Population Data

URL: http://arxiv.org/abs/2505.17133v1
Date: Thu, 22 May 2025 03:31:44 GMT
Title: Learning Probabilities of Causation from Finite Population Data
Authors: Shuai Wang, Song Jiang, Yizhou Sun, Judea Pearl, Ang Li,
Abstract summary: This paper addresses the challenge of predicting probabilities of causation for subpopulations with textbfinsufficient data.<n>We propose using machine learning models that draw insights from subpopulations with sufficient data.<n>Our evaluation of multiple machine learning models indicates that, given the population-level data and an appropriate choice of machine learning model and activation function, PNS can be effectively predicted.
Score: 49.05791737581312
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Probabilities of causation play a crucial role in modern decision-making. This paper addresses the challenge of predicting probabilities of causation for subpopulations with \textbf{insufficient} data using machine learning models. Tian and Pearl first defined and derived tight bounds for three fundamental probabilities of causation: the probability of necessity and sufficiency (PNS), the probability of sufficiency (PS), and the probability of necessity (PN). However, estimating these probabilities requires both experimental and observational distributions specific to each subpopulation, which are often unavailable or impractical to obtain with limited population-level data. Therefore, for most subgroups, the amount of data they have is not enough to guarantee the accuracy of their probabilities. Hence, to estimate these probabilities for subpopulations with \textbf{insufficient} data, we propose using machine learning models that draw insights from subpopulations with sufficient data. Our evaluation of multiple machine learning models indicates that, given the population-level data and an appropriate choice of machine learning model and activation function, PNS can be effectively predicted. Through simulation studies on multiple Structured Causal Models (SCMs), we show that our multilayer perceptron (MLP) model with the Mish activation function achieves a mean absolute error (MAE) of approximately $0.02$ in predicting PNS for $32,768$ subpopulations across most SCMs using data from only $2,000$ subpopulations with known PNS values.

Related papers

Estimating Probabilities of Causation with Machine Learning Models [13.50260067414662]
This paper addresses the challenge of predicting probabilities of causation for subpopulations with insufficient data.<n>We use machine learning models that draw insights from subpopulations with sufficient data.<n>We show that our multilayer perceptron (MLP) model with the Mish activation function achieves a mean absolute error (MAE) of approximately 0.02 in predicting probability of necessity (PNS) for 32,768 subpopulations.
arXiv Detail & Related papers (2025-02-13T00:18:08Z)
Probabilistic Contrastive Learning for Long-Tailed Visual Recognition [78.70453964041718]
Longtailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples. Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance. We propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space.
arXiv Detail & Related papers (2024-03-11T13:44:49Z)
R-divergence for Estimating Model-oriented Distribution Discrepancy [37.939239477868796]
We introduce R-divergence, designed to assess model-oriented distribution discrepancies. R-divergence learns a minimum hypothesis on the mixed data and then gauges the empirical risk difference between them. We evaluate the test power across various unsupervised and supervised tasks and find that R-divergence achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-10-02T11:30:49Z)
Confidence-Based Model Selection: When to Take Shortcuts for Subpopulation Shifts [119.22672589020394]
We propose COnfidence-baSed MOdel Selection (CosMoS), where model confidence can effectively guide model selection. We evaluate CosMoS on four datasets with spurious correlations, each with multiple test sets with varying levels of data distribution shift.
arXiv Detail & Related papers (2023-06-19T18:48:15Z)
An Epistemic and Aleatoric Decomposition of Arbitrariness to Constrain the Set of Good Models [7.620967781722717]
Recent research reveals that machine learning (ML) models are highly sensitive to minor changes in their training procedure.<n>We show that stability decomposes into epistemic and aleatoric components, capturing the consistency and confidence in prediction.<n>We propose a model selection procedure that includes epistemic and aleatoric criteria alongside existing accuracy and fairness criteria, and show that it successfully narrows down a large set of good models.
arXiv Detail & Related papers (2023-02-09T09:35:36Z)
Neural Spline Search for Quantile Probabilistic Modeling [35.914279831992964]
We propose a non-parametric and data-driven approach, Neural Spline Search (NSS), to represent the observed data distribution without parametric assumptions. We demonstrate that NSS outperforms previous methods on synthetic, real-world regression and time-series forecasting tasks.
arXiv Detail & Related papers (2023-01-12T07:45:28Z)
Learning Probabilities of Causation from Finite Population Data [40.99426447422972]
We propose a machine learning model that helps to learn the bounds of the probabilities of causation for subpopulations given finite population data. We show by a simulated study that the machine learning model is able to learn the bounds of PNS for 32768 subpopulations with only knowing roughly 500 of them from the finite population data.
arXiv Detail & Related papers (2022-10-16T05:46:25Z)
Learning Multivariate CDFs and Copulas using Tensor Factorization [39.24470798045442]
Learning the multivariate distribution of data is a core challenge in statistics and machine learning. In this work, we aim to learn multivariate cumulative distribution functions (CDFs), as they can handle mixed random variables. We show that any grid sampled version of a joint CDF of mixed random variables admits a universal representation as a naive Bayes model. We demonstrate the superior performance of the proposed model in several synthetic and real datasets and applications including regression, sampling and data imputation.
arXiv Detail & Related papers (2022-10-13T16:18:46Z)
Evaluating Distributional Distortion in Neural Language Modeling [81.83408583979745]
A heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language. Standard language modeling metrics such as perplexity quantify the performance of language models (LM) in aggregate. We develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages.
arXiv Detail & Related papers (2022-03-24T01:09:46Z)
Probabilistic Gradient Boosting Machines for Large-Scale Probabilistic Regression [51.770998056563094]
Probabilistic Gradient Boosting Machines (PGBM) is a method to create probabilistic predictions with a single ensemble of decision trees. We empirically demonstrate the advantages of PGBM compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2021-06-03T08:32:13Z)
General stochastic separation theorems with optimal bounds [68.8204255655161]
Phenomenon of separability was revealed and used in machine learning to correct errors of Artificial Intelligence (AI) systems and analyze AI instabilities. Errors or clusters of errors can be separated from the rest of the data. The ability to correct an AI system also opens up the possibility of an attack on it, and the high dimensionality induces vulnerabilities caused by the same separability.
arXiv Detail & Related papers (2020-10-11T13:12:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.