Related papers: Evaluating the Fairness of Deep Learning Uncertainty Estimates in Medical Image Analysis

Evaluating the Fairness of Deep Learning Uncertainty Estimates in Medical Image Analysis

URL: http://arxiv.org/abs/2303.03242v1
Date: Mon, 6 Mar 2023 16:01:30 GMT
Title: Evaluating the Fairness of Deep Learning Uncertainty Estimates in Medical Image Analysis
Authors: Raghav Mehta, Changjian Shui, Tal Arbel
Abstract summary: Deep learning (DL) models have shown great success in many medical image analysis tasks. However, deployment of the resulting models into real clinical contexts requires robustness and fairness across different sub-populations. Recent studies have shown significant biases in DL models across demographic subgroups, indicating a lack of fairness in the models.
Score: 3.5536769591744557
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although deep learning (DL) models have shown great success in many medical image analysis tasks, deployment of the resulting models into real clinical contexts requires: (1) that they exhibit robustness and fairness across different sub-populations, and (2) that the confidence in DL model predictions be accurately expressed in the form of uncertainties. Unfortunately, recent studies have indeed shown significant biases in DL models across demographic subgroups (e.g., race, sex, age) in the context of medical image analysis, indicating a lack of fairness in the models. Although several methods have been proposed in the ML literature to mitigate a lack of fairness in DL models, they focus entirely on the absolute performance between groups without considering their effect on uncertainty estimation. In this work, we present the first exploration of the effect of popular fairness models on overcoming biases across subgroups in medical image analysis in terms of bottom-line performance, and their effects on uncertainty quantification. We perform extensive experiments on three different clinically relevant tasks: (i) skin lesion classification, (ii) brain tumour segmentation, and (iii) Alzheimer's disease clinical score regression. Our results indicate that popular ML methods, such as data-balancing and distributionally robust optimization, succeed in mitigating fairness issues in terms of the model performances for some of the tasks. However, this can come at the cost of poor uncertainty estimates associated with the model predictions. This tradeoff must be mitigated if fairness models are to be adopted in medical image analysis.

Related papers

Fairness Analysis of CLIP-Based Foundation Models for X-Ray Image Classification [15.98427699337596]
We perform a comprehensive fairness analysis of CLIP-like models applied to X-ray image classification. We assess their performance and fairness across diverse patient demographics and disease categories using zero-shot inference and various fine-tuning techniques.
arXiv Detail & Related papers (2025-01-31T12:23:50Z)
Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis [55.959002385347645]
Latent Drifting enables diffusion models to be conditioned for medical images fitted for the complex task of counterfactual image generation. We evaluate our method on three public longitudinal benchmark datasets of brain MRI and chest X-rays for counterfactual image generation.
arXiv Detail & Related papers (2024-12-30T01:59:34Z)
Evaluating Machine Learning-based Skin Cancer Diagnosis [0.0]
The research assesses two convolutional neural network architectures: a MobileNet-based model and a custom CNN model. Both models are evaluated for their ability to classify skin lesions into seven categories and to distinguish between dangerous and benign lesions. The study concludes that while the models show promise in explainability, further development is needed to ensure fairness across different skin tones.
arXiv Detail & Related papers (2024-09-04T02:44:48Z)
PRECISe : Prototype-Reservation for Explainable Classification under Imbalanced and Scarce-Data Settings [0.0]
PRECISe is an explainable-by-design model meticulously constructed to address all three challenges. PreCISe outperforms the current state-of-the-art methods on data efficient generalization to minority classes. Case study is presented to highlight the model's ability to produce easily interpretable predictions.
arXiv Detail & Related papers (2024-08-11T12:05:32Z)
Fairness Evolution in Continual Learning for Medical Imaging [47.52603262576663]
We study the behavior of Continual Learning (CL) strategies in medical imaging regarding classification performance. We evaluate the Replay, Learning without Forgetting (LwF), LwF, and Pseudo-Label strategies. LwF and Pseudo-Label exhibit optimal classification performance, but when including fairness metrics in the evaluation, it is clear that Pseudo-Label is less biased.
arXiv Detail & Related papers (2024-04-10T09:48:52Z)
Inspecting Model Fairness in Ultrasound Segmentation Tasks [20.281029492841878]
We inspect a series of deep learning (DL) segmentation models using two ultrasound datasets. Our findings reveal that even state-of-the-art DL algorithms demonstrate unfair behavior in ultrasound segmentation tasks. These results serve as a crucial warning, underscoring the necessity for careful model evaluation before their deployment in real-world scenarios.
arXiv Detail & Related papers (2023-12-05T05:08:08Z)
On the Out of Distribution Robustness of Foundation Models in Medical Image Segmentation [47.95611203419802]
Foundations for vision and language, pre-trained on extensive sets of natural image and text data, have emerged as a promising approach. We compare the generalization performance to unseen domains of various pre-trained models after being fine-tuned on the same in-distribution dataset. We further developed a new Bayesian uncertainty estimation for frozen models and used them as an indicator to characterize the model's performance on out-of-distribution data.
arXiv Detail & Related papers (2023-11-18T14:52:10Z)
MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion. It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space. It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z)
Benchmarking Heterogeneous Treatment Effect Models through the Lens of Interpretability [82.29775890542967]
Estimating personalized effects of treatments is a complex, yet pervasive problem. Recent developments in the machine learning literature on heterogeneous treatment effect estimation gave rise to many sophisticated, but opaque, tools. We use post-hoc feature importance methods to identify features that influence the model's predictions.
arXiv Detail & Related papers (2022-06-16T17:59:05Z)
Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls [1.3870303451896246]
We implement random forest and deep convolutional neural network models using several medical imaging datasets. We show that violation of the independence assumption could substantially affect model generalizability. Inappropriate performance indicators could lead to erroneous conclusions.
arXiv Detail & Related papers (2022-02-01T05:07:27Z)
What Do You See in this Patient? Behavioral Testing of Clinical NLP Models [69.09570726777817]
We introduce an extendable testing framework that evaluates the behavior of clinical outcome models regarding changes of the input. We show that model behavior varies drastically even when fine-tuned on the same data and that allegedly best-performing models have not always learned the most medically plausible patterns.
arXiv Detail & Related papers (2021-11-30T15:52:04Z)
On the Robustness of Pretraining and Self-Supervision for a Deep Learning-based Analysis of Diabetic Retinopathy [70.71457102672545]
We compare the impact of different training procedures for diabetic retinopathy grading. We investigate different aspects such as quantitative performance, statistics of the learned feature representations, interpretability and robustness to image distortions. Our results indicate that models from ImageNet pretraining report a significant increase in performance, generalization and robustness to image distortions.
arXiv Detail & Related papers (2021-06-25T08:32:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.