Intersectional Fairness in Vision-Language Models for Medical Image Disease Classification
- URL: http://arxiv.org/abs/2512.15249v2
- Date: Wed, 24 Dec 2025 12:33:48 GMT
- Title: Intersectional Fairness in Vision-Language Models for Medical Image Disease Classification
- Authors: Yupeng Zhang, Adam G. Dunn, Usman Naseem, Jinman Kim,
- Abstract summary: Cross-Modal Alignment Consistency (CMAC-MMD) is a training framework that standardises diagnostic certainty across intersectional patient subgroups.<n>In the dermatology cohort, the proposed method reduced the overall intersectional missed diagnosis gap (difference in True Positive Rate, $$TPR) from 0.50 to 0.26.<n>For glaucoma screening, the method reduced $$TPR from 0.41 to 0.31, achieving a better AUC of 0.72 (vs. 0.71 baseline)
- Score: 25.30858592524878
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Medical artificial intelligence (AI) systems, particularly multimodal vision-language models (VLM), often exhibit intersectional biases where models are systematically less confident in diagnosing marginalised patient subgroups. Such bias can lead to higher rates of inaccurate and missed diagnoses due to demographically skewed data and divergent distributions of diagnostic certainty. Current fairness interventions frequently fail to address these gaps or compromise overall diagnostic performance to achieve statistical parity among the subgroups. In this study, we developed Cross-Modal Alignment Consistency (CMAC-MMD), a training framework that standardises diagnostic certainty across intersectional patient subgroups. Unlike traditional debiasing methods, this approach equalises the model's decision confidence without requiring sensitive demographic data during clinical inference. We evaluated this approach using 10,015 skin lesion images (HAM10000) with external validation on 12,000 images (BCN20000), and 10,000 fundus images for glaucoma detection (Harvard-FairVLMed), stratifying performance by intersectional age, gender, and race attributes. In the dermatology cohort, the proposed method reduced the overall intersectional missed diagnosis gap (difference in True Positive Rate, $Δ$TPR) from 0.50 to 0.26 while improving the overall Area Under the Curve (AUC) from 0.94 to 0.97 compared to standard training. Similarly, for glaucoma screening, the method reduced $Δ$TPR from 0.41 to 0.31, achieving a better AUC of 0.72 (vs. 0.71 baseline). This establishes a scalable framework for developing high-stakes clinical decision support systems that are both accurate and can perform equitably across diverse patient subgroups, ensuring reliable performance without increasing privacy risks.
Related papers
- Patient-Conditioned Adaptive Offsets for Reliable Diagnosis across Subgroups [11.237134615215977]
We introduce HyperAdapt, a patient-conditioned adaptation framework that improves subgroup reliability while maintaining a shared diagnostic model.<n> Experiments across multiple public medical imaging benchmarks demonstrate that the proposed approach consistently improves subgroup-level performance without sacrificing overall accuracy.
arXiv Detail & Related papers (2026-01-19T14:33:41Z) - Case Prompting to Mitigate Large Language Model Bias for ICU Mortality Prediction [17.91443453604627]
Large language models (LLMs) show promise in predicting outcomes from structured medical data.<n>LLMs may exhibit demographic biases related to sex, age, and race, limiting their trustworthy use in clinical practice.<n>We propose a training-free, clinically adaptive prompting framework to simultaneously improve fairness and performance.
arXiv Detail & Related papers (2025-12-17T12:29:53Z) - Fairness in Multi-modal Medical Diagnosis with Demonstration Selection [45.767489124851814]
We propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations.<n>FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy.<n>These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.
arXiv Detail & Related papers (2025-11-20T02:38:00Z) - MeCaMIL: Causality-Aware Multiple Instance Learning for Fair and Interpretable Whole Slide Image Diagnosis [40.3028468133626]
Multiple instance learning (MIL) has emerged as the dominant paradigm for whole slide image (WSI) analysis in computational pathology.<n>textbfMeCaMIL, a causality-aware MIL framework, explicitly models demographic confounders through structured causal graphs.<n>MeCaMIL achieves superior fairness -- demographic disparity variance drops by over 65% relative reduction on average across attributes.
arXiv Detail & Related papers (2025-11-14T06:47:21Z) - FAST-CAD: A Fairness-Aware Framework for Non-Contact Stroke Diagnosis [8.939811267715228]
We propose FAST-CAD, a theoretically grounded framework that combines domain-adversarial training with group distributionally robust optimization.<n>Our approach is built on domain adaptation and minimax fairness theory and provides convergence bounds guarantees and fairness.<n>Experiments show that our method achieves superior diagnostic performance while maintaining fairness across demographic groups.
arXiv Detail & Related papers (2025-11-12T01:40:58Z) - A Novel Attention-Augmented Wavelet YOLO System for Real-time Brain Vessel Segmentation on Transcranial Color-coded Doppler [49.03919553747297]
We propose an AI-powered, real-time CoW auto-segmentation system capable of efficiently capturing cerebral arteries.<n>No prior studies have explored AI-driven cerebrovascular segmentation using Transcranial Color-coded Doppler (TCCD)<n>The proposed AAW-YOLO demonstrated strong performance in segmenting both ipsilateral and contralateral CoW vessels.
arXiv Detail & Related papers (2025-08-19T14:41:22Z) - Debias-CLR: A Contrastive Learning Based Debiasing Method for Algorithmic Fairness in Healthcare Applications [0.17624347338410748]
We proposed an implicit in-processing debiasing method to combat disparate treatment.
We used clinical notes of heart failure patients and used diagnostic codes, procedure reports and physiological vitals of the patients.
We found that Debias-CLR was able to reduce the Single-Category Word Embedding Association Test (SC-WEAT) effect size score when debiasing for gender and ethnicity.
arXiv Detail & Related papers (2024-11-15T19:32:01Z) - Fairness Evolution in Continual Learning for Medical Imaging [47.52603262576663]
This study examines how bias evolves across tasks using domain-specific fairness metrics and how different CL strategies impact this evolution.<n>Our results show that Learning without Forgetting and Pseudo-Label achieve optimal classification performance, but Pseudo-Label is less biased.
arXiv Detail & Related papers (2024-04-10T09:48:52Z) - Rescuing referral failures during automated diagnosis of domain-shifted
medical images [17.349847762608086]
We show that even state-of-the-art domain generalization approaches fail severely during referral when tested on medical images acquired from a different demographic or using a different technology.
We evaluate novel combinations of robust generalization and post hoc referral approaches, that rescue these failures and achieve significant performance improvements.
arXiv Detail & Related papers (2023-11-28T13:14:55Z) - Learning to diagnose cirrhosis from radiological and histological labels
with joint self and weakly-supervised pretraining strategies [62.840338941861134]
We propose to leverage transfer learning from large datasets annotated by radiologists, to predict the histological score available on a small annex dataset.
We compare different pretraining methods, namely weakly-supervised and self-supervised ones, to improve the prediction of the cirrhosis.
This method outperforms the baseline classification of the METAVIR score, reaching an AUC of 0.84 and a balanced accuracy of 0.75.
arXiv Detail & Related papers (2023-02-16T17:06:23Z) - Private, fair and accurate: Training large-scale, privacy-preserving AI models in medical imaging [47.99192239793597]
We evaluated the effect of privacy-preserving training of AI models regarding accuracy and fairness compared to non-private training.
Our study shows that -- under the challenging realistic circumstances of a real-life clinical dataset -- the privacy-preserving training of diagnostic deep learning models is possible with excellent diagnostic accuracy and fairness.
arXiv Detail & Related papers (2023-02-03T09:49:13Z) - Estimating and Improving Fairness with Adversarial Learning [65.99330614802388]
We propose an adversarial multi-task training strategy to simultaneously mitigate and detect bias in the deep learning-based medical image analysis system.
Specifically, we propose to add a discrimination module against bias and a critical module that predicts unfairness within the base classification model.
We evaluate our framework on a large-scale public-available skin lesion dataset.
arXiv Detail & Related papers (2021-03-07T03:10:32Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.