BiasICL: In-Context Learning and Demographic Biases of Vision Language Models
- URL: http://arxiv.org/abs/2503.02334v1
- Date: Tue, 04 Mar 2025 06:45:54 GMT
- Title: BiasICL: In-Context Learning and Demographic Biases of Vision Language Models
- Authors: Sonnet Xu, Joseph Janizek, Yixing Jiang, Roxana Daneshjou,
- Abstract summary: Vision language models (VLMs) show promise in medical diagnosis, but their performance across demographic subgroups when using in-context learning (ICL) remains poorly understood.<n>We examine how the demographic composition of demonstration examples affects VLM performance in two medical imaging tasks: skin lesion malignancy prediction and pneumothorax detection from chest radiographs.
- Score: 0.7499722271664147
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Vision language models (VLMs) show promise in medical diagnosis, but their performance across demographic subgroups when using in-context learning (ICL) remains poorly understood. We examine how the demographic composition of demonstration examples affects VLM performance in two medical imaging tasks: skin lesion malignancy prediction and pneumothorax detection from chest radiographs. Our analysis reveals that ICL influences model predictions through multiple mechanisms: (1) ICL allows VLMs to learn subgroup-specific disease base rates from prompts and (2) ICL leads VLMs to make predictions that perform differently across demographic groups, even after controlling for subgroup-specific disease base rates. Our empirical results inform best-practices for prompting current VLMs (specifically examining demographic subgroup performance, and matching base rates of labels to target distribution at a bulk level and within subgroups), while also suggesting next steps for improving our theoretical understanding of these models.
Related papers
- Investigating LLMs in Clinical Triage: Promising Capabilities, Persistent Intersectional Biases [6.135648377533492]
Large Language Models (LLMs) have shown promise in clinical decision support, yet their application to triage remains underexplored.
We systematically investigate the capabilities of LLMs in emergency department triage through two key dimensions.
We assess multiple LLM-based approaches, ranging from continued pre-training to in-context learning, as well as machine learning approaches.
arXiv Detail & Related papers (2025-04-22T21:11:47Z) - Estimating Commonsense Plausibility through Semantic Shifts [66.06254418551737]
We propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts.<n> Evaluations on two types of fine-grained commonsense plausibility estimation tasks show that ComPaSS consistently outperforms baselines.
arXiv Detail & Related papers (2025-02-19T06:31:06Z) - DiversityMedQA: Assessing Demographic Biases in Medical Diagnosis using Large Language Models [2.750784330885499]
We introduce DiversityMedQA, a novel benchmark designed to assess large language models (LLMs) responses to medical queries across diverse patient demographics.<n>Our findings reveal notable discrepancies in model performance when tested against these demographic variations.
arXiv Detail & Related papers (2024-09-02T23:37:20Z) - Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [53.629132242389716]
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions.
VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information.
We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
arXiv Detail & Related papers (2024-05-29T23:19:28Z) - Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias [3.455189439319919]
We introduce Cross-Care, the first benchmark framework dedicated to assessing biases and real world knowledge in large language models (LLMs)
We evaluate how demographic biases embedded in pre-training corpora like $ThePile$ influence the outputs of LLMs.
Our results highlight substantial misalignment between LLM representation of disease prevalence and real disease prevalence rates across demographic subgroups.
arXiv Detail & Related papers (2024-05-09T02:33:14Z) - SkinGEN: an Explainable Dermatology Diagnosis-to-Generation Framework with Interactive Vision-Language Models [54.32264601568605]
SkinGEN is a diagnosis-to-generation framework that generates reference demonstrations from diagnosis results provided by VLM.<n>We conduct a user study with 32 participants evaluating both the system performance and explainability.<n>Results demonstrate that SkinGEN significantly improves users' comprehension of VLM predictions and fosters increased trust in the diagnostic process.
arXiv Detail & Related papers (2024-04-23T05:36:33Z) - FairCLIP: Harnessing Fairness in Vision-Language Learning [20.743027598445796]
We introduce the first fair vision-language medical dataset that provides detailed demographic attributes, groundtruth labels, and clinical notes.
As the first fair vision-language medical dataset of its kind, HarvardFairMed holds the potential to catalyze the development of machine learning models that are both aware and clinically effective.
arXiv Detail & Related papers (2024-03-29T03:15:31Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Auditing Algorithmic Fairness in Machine Learning for Health with
Severity-Based LOGAN [70.76142503046782]
We propose supplementing machine learning-based (ML) healthcare tools for bias with SLOGAN, an automatic tool for capturing local biases in a clinical prediction task.
LOGAN adapts an existing tool, LOcal Group biAs detectioN, by contextualizing group bias detection in patient illness severity and past medical history.
On average, SLOGAN identifies larger fairness disparities in over 75% of patient groups than LOGAN while maintaining clustering quality.
arXiv Detail & Related papers (2022-11-16T08:04:12Z) - Assessing Social Determinants-Related Performance Bias of Machine
Learning Models: A case of Hyperchloremia Prediction in ICU Population [6.8473641147443995]
We evaluated four classifiers built to predict Hyperchloremia, a condition that often results from aggressive fluids administration in the ICU population.
We observed that adding social determinants features in addition to the lab-based ones improved model performance on all patients.
We urge future researchers to design models that proactively adjust for potential biases and include subgroup reporting.
arXiv Detail & Related papers (2021-11-18T03:58:50Z) - Adversarial Sample Enhanced Domain Adaptation: A Case Study on
Predictive Modeling with Electronic Health Records [57.75125067744978]
We propose a data augmentation method to facilitate domain adaptation.
adversarially generated samples are used during domain adaptation.
Results confirm the effectiveness of our method and the generality on different tasks.
arXiv Detail & Related papers (2021-01-13T03:20:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.