From Biased Selective Labels to Pseudo-Labels: An Expectation-Maximization Framework for Learning from Biased Decisions
- URL: http://arxiv.org/abs/2406.18865v1
- Date: Thu, 27 Jun 2024 03:33:38 GMT
- Title: From Biased Selective Labels to Pseudo-Labels: An Expectation-Maximization Framework for Learning from Biased Decisions
- Authors: Trenton Chang, Jenna Wiens,
- Abstract summary: We study a clinically-inspired selective label problem called disparate censorship.
Disparate Censorship Expectation-Maximization (DCEM) is an algorithm for learning in the presence of such censorship.
- Score: 9.440055827786596
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Selective labels occur when label observations are subject to a decision-making process; e.g., diagnoses that depend on the administration of laboratory tests. We study a clinically-inspired selective label problem called disparate censorship, where labeling biases vary across subgroups and unlabeled individuals are imputed as "negative" (i.e., no diagnostic test = no illness). Machine learning models naively trained on such labels could amplify labeling bias. Inspired by causal models of selective labels, we propose Disparate Censorship Expectation-Maximization (DCEM), an algorithm for learning in the presence of disparate censorship. We theoretically analyze how DCEM mitigates the effects of disparate censorship on model performance. We validate DCEM on synthetic data, showing that it improves bias mitigation (area between ROC curves) without sacrificing discriminative performance (AUC) compared to baselines. We achieve similar results in a sepsis classification task using clinical data.
Related papers
- Label Set Optimization via Activation Distribution Kurtosis for Zero-shot Classification with Generative Models [10.699636123243138]
In-context learning (ICL) performance is sensitive to the prompt design, yet the impact of class label options in zero-shot classification has been largely overlooked.
This study presents the first comprehensive empirical study investigating how label option influences zero-shot ICL classification performance.
arXiv Detail & Related papers (2024-10-24T22:59:23Z) - You can't handle the (dirty) truth: Data-centric insights improve pseudo-labeling [60.27812493442062]
We show the importance of investigating labeled data quality to improve any pseudo-labeling method.
Specifically, we introduce a novel data characterization and selection framework called DIPS to extend pseudo-labeling.
We demonstrate the applicability and impact of DIPS for various pseudo-labeling methods across an extensive range of real-world datasets.
arXiv Detail & Related papers (2024-06-19T17:58:40Z) - Label-Retrieval-Augmented Diffusion Models for Learning from Noisy
Labels [61.97359362447732]
Learning from noisy labels is an important and long-standing problem in machine learning for real applications.
In this paper, we reformulate the label-noise problem from a generative-model perspective.
Our model achieves new state-of-the-art (SOTA) results on all the standard real-world benchmark datasets.
arXiv Detail & Related papers (2023-05-31T03:01:36Z) - Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly
Supervised Video Anomaly Detection [149.23913018423022]
Weakly supervised video anomaly detection aims to identify abnormal events in videos using only video-level labels.
Two-stage self-training methods have achieved significant improvements by self-generating pseudo labels.
We propose an enhancement framework by exploiting completeness and uncertainty properties for effective self-training.
arXiv Detail & Related papers (2022-12-08T05:53:53Z) - Dist-PU: Positive-Unlabeled Learning from a Label Distribution
Perspective [89.5370481649529]
We propose a label distribution perspective for PU learning in this paper.
Motivated by this, we propose to pursue the label distribution consistency between predicted and ground-truth label distributions.
Experiments on three benchmark datasets validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-06T07:38:29Z) - Disparate Censorship & Undertesting: A Source of Label Bias in Clinical
Machine Learning [14.133370438685969]
Disparate censorship in patients of equivalent risk leads to undertesting in certain groups, and in turn, more biased labels for such groups.
Our findings call attention to disparate censorship as a source of label bias in clinical ML models.
arXiv Detail & Related papers (2022-08-01T20:15:31Z) - Instance-Dependent Partial Label Learning [69.49681837908511]
Partial label learning is a typical weakly supervised learning problem.
Most existing approaches assume that the incorrect labels in each training example are randomly picked as the candidate labels.
In this paper, we consider instance-dependent and assume that each example is associated with a latent label distribution constituted by the real number of each label.
arXiv Detail & Related papers (2021-10-25T12:50:26Z) - Active label cleaning: Improving dataset quality under resource
constraints [13.716577886649018]
Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models.
This work advocates for a data-driven approach to prioritising samples for re-annotation.
We rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy.
arXiv Detail & Related papers (2021-09-01T19:03:57Z) - Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language.
We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection.
Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z) - Learning Image Labels On-the-fly for Training Robust Classification
Models [13.669654965671604]
We show how noisy annotations (e.g., from different algorithm-based labelers) can be utilized together and mutually benefit the learning of classification tasks.
A meta-training based label-sampling module is designed to attend the labels that benefit the model learning the most through additional back-propagation processes.
arXiv Detail & Related papers (2020-09-22T05:38:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.