Bias Detection via Maximum Subgroup Discrepancy
- URL: http://arxiv.org/abs/2502.02221v2
- Date: Wed, 11 Jun 2025 08:50:00 GMT
- Title: Bias Detection via Maximum Subgroup Discrepancy
- Authors: Jiří Němeček, Mark Kozdoba, Illia Kryvoviaz, Tomáš Pevný, Jakub Mareček,
- Abstract summary: We propose a new notion of distance, the Maximum Subgroup Discrepancy (MSD)<n>In this metric, two distributions are close if, roughly, discrepancies are low for all feature subgroups.<n>We show that the sample complexity is linear in the number of features, thus making it feasible for practical applications.
- Score: 2.236957801565796
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bias evaluation is fundamental to trustworthy AI, both in terms of checking data quality and in terms of checking the outputs of AI systems. In testing data quality, for example, one may study the distance of a given dataset, viewed as a distribution, to a given ground-truth reference dataset. However, classical metrics, such as the Total Variation and the Wasserstein distances, are known to have high sample complexities and, therefore, may fail to provide a meaningful distinction in many practical scenarios. In this paper, we propose a new notion of distance, the Maximum Subgroup Discrepancy (MSD). In this metric, two distributions are close if, roughly, discrepancies are low for all feature subgroups. While the number of subgroups may be exponential, we show that the sample complexity is linear in the number of features, thus making it feasible for practical applications. Moreover, we provide a practical algorithm for evaluating the distance based on Mixed-integer optimization (MIO). We also note that the proposed distance is easily interpretable, thus providing clearer paths to fixing the biases once they have been identified. Finally, we describe a natural general bias detection framework, termed MSDD distances, and show that MSD aligns well with this framework. We empirically evaluate MSD by comparing it with other metrics and by demonstrating the above properties of MSD on real-world datasets.
Related papers
- Optimal Transport with Heterogeneously Missing Data [12.896319628045967]
We consider the problem of solving the optimal transport problem between two empirical distributions with missing values.<n>We show that the Wasserstein distance between empirical distributions and linear Monge maps can be debiased without significantly affecting the sample complexity.
arXiv Detail & Related papers (2025-05-22T21:16:22Z) - Sublinear Algorithms for Wasserstein and Total Variation Distances: Applications to Fairness and Privacy Auditing [7.81603404636933]
We propose a generic algorithmic framework to estimate the PDF and CDF of any sub-Gaussian distribution while the samples from them arrive in a stream.
We compute mergeable summaries of distributions from the stream of samples that require sublinear space w.r.t. the number of observed samples.
This allows us to estimate Wasserstein and Total Variation (TV) distances between any two sub-Gaussian while samples arrive in streams and from multiple sources.
arXiv Detail & Related papers (2025-03-10T18:57:48Z) - Sample Complexity of Bias Detection with Subsampled Point-to-Subspace Distances [2.7624021966289596]
Sample complexity of bias estimation is a lower bound on the runtime of any bias detection method.
We reformulate bias detection as a point-to-subspace problem on the space of measures and show that, for supremum norm, it can be subsampled efficiently.
arXiv Detail & Related papers (2025-02-04T14:03:49Z) - Symmetry Discovery for Different Data Types [52.2614860099811]
Equivariant neural networks incorporate symmetries into their architecture, achieving higher generalization performance.
We propose LieSD, a method for discovering symmetries via trained neural networks which approximate the input-output mappings of the tasks.
We validate the performance of LieSD on tasks with symmetries such as the two-body problem, the moment of inertia matrix prediction, and top quark tagging.
arXiv Detail & Related papers (2024-10-13T13:39:39Z) - Improving Distribution Alignment with Diversity-based Sampling [0.0]
Domain shifts are ubiquitous in machine learning, and can substantially degrade a model's performance when deployed to real-world data.
This paper proposes to improve these estimates by inducing diversity in each sampled minibatch.
It simultaneously balances the data and reduces the variance of the gradients, thereby enhancing the model's generalisation ability.
arXiv Detail & Related papers (2024-10-05T17:26:03Z) - A structured regression approach for evaluating model performance across intersectional subgroups [53.91682617836498]
Disaggregated evaluation is a central task in AI fairness assessment, where the goal is to measure an AI system's performance across different subgroups.
We introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups.
arXiv Detail & Related papers (2024-01-26T14:21:45Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Anomaly Detection Under Uncertainty Using Distributionally Robust
Optimization Approach [0.9217021281095907]
Anomaly detection is defined as the problem of finding data points that do not follow the patterns of the majority.
The one-class Support Vector Machines (SVM) method aims to find a decision boundary to distinguish between normal data points and anomalies.
A distributionally robust chance-constrained model is proposed in which the probability of misclassification is low.
arXiv Detail & Related papers (2023-12-03T06:13:22Z) - Computing the Distance between unbalanced Distributions -- The flat Metric [0.0]
We provide an implementation to compute the flat metric in any dimension.<n>Our implementation adapts very well to mass differences and uses them to distinguish between different distributions.
arXiv Detail & Related papers (2023-08-02T09:30:22Z) - Approximating a RUM from Distributions on k-Slates [88.32814292632675]
We find a generalization-time algorithm that finds the RUM that best approximates the given distribution on average.
Our theoretical result can also be made practical: we obtain a that is effective and scales to real-world datasets.
arXiv Detail & Related papers (2023-05-22T17:43:34Z) - Conditional expectation with regularization for missing data imputation [19.254291863337347]
Missing data frequently occurs in datasets across various domains, such as medicine, sports, and finance.
We propose a new algorithm named "conditional Distribution-based Imputation of Missing Values with Regularization" (DIMV)
DIMV operates by determining the conditional distribution of a feature that has missing entries, using the information from the fully observed features as a basis.
arXiv Detail & Related papers (2023-02-02T06:59:15Z) - Fixed and adaptive landmark sets for finite pseudometric spaces [0.9137554315375919]
"Lastfirst" based on ranked distances implies a cover comprising sets of uniform cardinality.
We perform benchmark tests and compare its performance to that of maxmin on feature detection and class prediction tasks.
We find that lastfirst achieves comparable performance on prediction tasks and outperforms maxmin on homology detection tasks.
arXiv Detail & Related papers (2022-12-19T19:53:33Z) - Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic
Uncertainty [58.144520501201995]
Bi-Lipschitz regularization of neural network layers preserve relative distances between data instances in the feature spaces of each layer.
With the use of an attentive set encoder, we propose to meta learn either diagonal or diagonal plus low-rank factors to efficiently construct task specific covariance matrices.
We also propose an inference procedure which utilizes scaled energy to achieve a final predictive distribution.
arXiv Detail & Related papers (2021-10-12T22:04:19Z) - Causal Order Identification to Address Confounding: Binary Variables [4.56877715768796]
This paper considers an extension of the linear non-Gaussian acyclic model (LiNGAM)
LiNGAM determines the causal order among variables from a dataset when the variables are expressed by a set of linear equations, including noise.
arXiv Detail & Related papers (2021-08-10T22:09:43Z) - A Unified Joint Maximum Mean Discrepancy for Domain Adaptation [73.44809425486767]
This paper theoretically derives a unified form of JMMD that is easy to optimize.
From the revealed unified JMMD, we illustrate that JMMD degrades the feature-label dependence that benefits to classification.
We propose a novel MMD matrix to promote the dependence, and devise a novel label kernel that is robust to label distribution shift.
arXiv Detail & Related papers (2021-01-25T09:46:14Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - The Gap on GAP: Tackling the Problem of Differing Data Distributions in
Bias-Measuring Datasets [58.53269361115974]
Diagnostic datasets that can detect biased models are an important prerequisite for bias reduction within natural language processing.
undesired patterns in the collected data can make such tests incorrect.
We introduce a theoretically grounded method for weighting test samples to cope with such patterns in the test data.
arXiv Detail & Related papers (2020-11-03T16:50:13Z) - Learning to Match Distributions for Domain Adaptation [116.14838935146004]
This paper proposes Learning to Match (L2M) to automatically learn the cross-domain distribution matching.
L2M reduces the inductive bias by using a meta-network to learn the distribution matching loss in a data-driven way.
Experiments on public datasets substantiate the superiority of L2M over SOTA methods.
arXiv Detail & Related papers (2020-07-17T03:26:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.