Representation Invariance and Allocation: When Subgroup Balance Matters
- URL: http://arxiv.org/abs/2512.09496v1
- Date: Wed, 10 Dec 2025 10:19:48 GMT
- Title: Representation Invariance and Allocation: When Subgroup Balance Matters
- Authors: Anissa Alloula, Charles Jones, Zuzanna Wakefield-Skorniewska, Francesco Quinzan, Bartłomiej Papież,
- Abstract summary: In some cases, imbalanced data distributions actually improve subgroup performance, while in others, subgroup performance remains unaffected by the absence of an entire subgroup during training.<n>We propose the latent separation hypothesis, which states that a partially fine-tuned model's dependence on subgroup representation is determined by the degree of separation between subgroups in the latent space of the pre-trained model.
- Score: 2.910375306412165
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unequal representation of demographic groups in training data poses challenges to model generalisation across populations. Standard practice assumes that balancing subgroup representation optimises performance. However, recent empirical results contradict this assumption: in some cases, imbalanced data distributions actually improve subgroup performance, while in others, subgroup performance remains unaffected by the absence of an entire subgroup during training. We conduct a systematic study of subgroup allocation across four vision and language models, varying training data composition to characterise the sensitivity of subgroup performance to data balance. We propose the latent separation hypothesis, which states that a partially fine-tuned model's dependence on subgroup representation is determined by the degree of separation between subgroups in the latent space of the pre-trained model. We formalise this hypothesis, provide theoretical analysis, and validate it empirically. Finally, we present a practical application to foundation model fine-tuning, demonstrating that quantitative analysis of latent subgroup separation can inform data collection and balancing decisions.
Related papers
- GUDA: Counterfactual Group-wise Training Data Attribution for Diffusion Models via Unlearning [83.56510119503267]
Group-wise attribution is counterfactual: how would a model's behavior on a generated sample change if a group were absent from training?<n>We propose GUDA (Group Unlearning-based Data Attribution) for diffusion models, which approximates each counterfactual model by applying machine unlearning to a shared full-data model instead of training from scratch.
arXiv Detail & Related papers (2026-01-30T07:10:59Z) - Understanding challenges to the interpretation of disaggregated evaluations of algorithmic fairness [49.35494016290887]
We show that equal performance across subgroups is an unreliable measure of fairness when data are representative of relevant populations but reflective of real-world disparities.<n>Our framework suggests complementing disaggregated evaluations with explicit causal assumptions and analysis to control for confounding and distribution shift.
arXiv Detail & Related papers (2025-06-04T17:40:31Z) - An active learning framework for multi-group mean estimation [11.799152724436999]
We study a fundamental learning problem over multiple groups with unknown data distributions.<n>We propose an algorithm, Variance-UCB, that sequentially selects groups according to an upper confidence bound on the variance estimate.
arXiv Detail & Related papers (2025-05-20T20:13:04Z) - A Contrastive Learning Approach to Mitigate Bias in Speech Models [13.192011475857234]
We employ a three-level learning technique that guides the model in focusing on different scopes for the contrastive loss.
Experiments on two spoken language understanding datasets and two languages demonstrate that our approach improves internal subgroup representations.
arXiv Detail & Related papers (2024-06-20T19:20:00Z) - A structured regression approach for evaluating model performance across intersectional subgroups [53.91682617836498]
Disaggregated evaluation is a central task in AI fairness assessment, where the goal is to measure an AI system's performance across different subgroups.
We introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups.
arXiv Detail & Related papers (2024-01-26T14:21:45Z) - The Role of Subgroup Separability in Group-Fair Medical Image
Classification [18.29079361470428]
We find a relationship between subgroup separability, subgroup disparities, and performance degradation when models are trained on data with systematic bias such as underdiagnosis.
Our findings shed new light on the question of how models become biased, providing important insights for the development of fair medical imaging AI.
arXiv Detail & Related papers (2023-07-06T06:06:47Z) - Leveraging Structure for Improved Classification of Grouped Biased Data [8.121462458089143]
We consider semi-supervised binary classification for applications in which data points are naturally grouped.
We derive a semi-supervised algorithm that explicitly leverages the structure to learn an optimal, group-aware, probability-outputd classifier.
arXiv Detail & Related papers (2022-12-07T15:18:21Z) - Addressing Missing Sources with Adversarial Support-Matching [8.53946780558779]
We investigate a scenario in which the absence of certain data is linked to the second level of a two-level hierarchy in the data.
Inspired by the idea of protected groups from algorithmic fairness, we refer to the partitions carved by this second level as "subgroups"
We make use of an additional, diverse but unlabeled dataset, called the "deployment set", to learn a representation that is invariant to subgroup.
arXiv Detail & Related papers (2022-03-24T16:19:19Z) - Towards Group Robustness in the presence of Partial Group Labels [61.33713547766866]
spurious correlations between input samples and the target labels wrongly direct the neural network predictions.
We propose an algorithm that optimize for the worst-off group assignments from a constraint set.
We show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
arXiv Detail & Related papers (2022-01-10T22:04:48Z) - Representation Matters: Assessing the Importance of Subgroup Allocations
in Training Data [85.43008636875345]
We show that diverse representation in training data is key to increasing subgroup performances and achieving population level objectives.
Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design.
arXiv Detail & Related papers (2021-03-05T00:27:08Z) - LOGAN: Local Group Bias Detection by Clustering [86.38331353310114]
We argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model.
We propose LOGAN, a new bias detection technique based on clustering.
Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region.
arXiv Detail & Related papers (2020-10-06T16:42:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.