Building Trust in Clinical LLMs: Bias Analysis and Dataset Transparency
- URL: http://arxiv.org/abs/2510.18556v1
- Date: Tue, 21 Oct 2025 12:08:39 GMT
- Title: Building Trust in Clinical LLMs: Bias Analysis and Dataset Transparency
- Authors: Svetlana Maslenkova, Clement Christophe, Marco AF Pimentel, Tathagata Raha, Muhammad Umar Salman, Ahmed Al Mahrooqi, Avani Gupta, Shadab Khan, Ronnie Rajan, Praveenkumar Kanithi,
- Abstract summary: We present an in-depth analysis of potential downstream biases in clinical language models.<n>We focus on differential opioid prescription tendencies across diverse demographic groups, such as ethnicity, gender, and age.<n>We introduce HC4: Healthcare Comprehensive Commons Corpus, a novel and extensively curated pretraining dataset exceeding 89 billion tokens.
- Score: 3.097915343305639
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models offer transformative potential for healthcare, yet their responsible and equitable development depends critically on a deeper understanding of how training data characteristics influence model behavior, including the potential for bias. Current practices in dataset curation and bias assessment often lack the necessary transparency, creating an urgent need for comprehensive evaluation frameworks to foster trust and guide improvements. In this study, we present an in-depth analysis of potential downstream biases in clinical language models, with a focus on differential opioid prescription tendencies across diverse demographic groups, such as ethnicity, gender, and age. As part of this investigation, we introduce HC4: Healthcare Comprehensive Commons Corpus, a novel and extensively curated pretraining dataset exceeding 89 billion tokens. Our evaluation leverages both established general benchmarks and a novel, healthcare-specific methodology, offering crucial insights to support fairness and safety in clinical AI applications.
Related papers
- Towards Reliable Medical LLMs: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation [97.36081721024728]
We propose the first benchmark for assessing confidence in multi-turn interaction during realistic medical consultations.<n>Our benchmark unifies three types of medical data for open-ended diagnostic generation.<n>We present MedConf, an evidence-grounded linguistic self-assessment framework.
arXiv Detail & Related papers (2026-01-22T04:51:39Z) - Beyond Traditional Diagnostics: Transforming Patient-Side Information into Predictive Insights with Knowledge Graphs and Prototypes [55.310195121276074]
We propose a Knowledge graph-enhanced, Prototype-aware, and Interpretable (KPI) framework to predict diseases.<n>It integrates structured and trusted medical knowledge into a unified disease knowledge graph, constructs clinically meaningful disease prototypes, and employs contrastive learning to enhance predictive accuracy.<n>It provides clinically valid explanations that closely align with patient narratives, highlighting its practical value for patient-centered healthcare delivery.
arXiv Detail & Related papers (2025-12-09T05:37:54Z) - Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
Large Language Models (LLMs) in medicine have enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning.<n>This paper provides the first systematic review of this emerging field.<n>We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies and test-time mechanisms.
arXiv Detail & Related papers (2025-08-01T14:41:31Z) - Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z) - Predictive Representativity: Uncovering Racial Bias in AI-based Skin Cancer Detection [0.0]
This paper introduces the concept of Predictive Representativity (PR)<n>PR shifts the focus from the composition of the data set to outcomes-level equity.<n>Our analysis reveals substantial performance disparities by skin phototype.
arXiv Detail & Related papers (2025-07-10T22:21:06Z) - Bias in Large Language Models Across Clinical Applications: A Systematic Review [0.0]
Large language models (LLMs) are rapidly being integrated into healthcare, promising to enhance various clinical tasks.<n>This systematic review investigates the prevalence, sources, manifestations, and clinical implications of bias in LLMs.
arXiv Detail & Related papers (2025-04-03T13:32:08Z) - Systematic Literature Review on Clinical Trial Eligibility Matching [0.24554686192257422]
Review highlights how explainable AI and standardized ontology can bolster clinician trust and broaden adoption.<n>Further research into advanced semantic and temporal representations, expanded data integration, and rigorous prospective evaluations is necessary to fully realize the transformative potential of NLP in clinical trial recruitment.
arXiv Detail & Related papers (2025-03-02T11:45:50Z) - SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy [45.2233252981348]
Large Language Models (LLMs) have been shown to encode clinical knowledge.<n>We present SemioLLM, an evaluation framework that benchmarks 6 state-of-the-art models.<n>We show that most LLMs are able to accurately and confidently generate probabilistic predictions of seizure onset zones in the brain.
arXiv Detail & Related papers (2024-07-03T11:02:12Z) - A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models [20.11590976578911]
Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities.
Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity.
We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions.
arXiv Detail & Related papers (2024-03-18T17:56:37Z) - Evaluating the Fairness of the MIMIC-IV Dataset and a Baseline
Algorithm: Application to the ICU Length of Stay Prediction [65.268245109828]
This paper uses the MIMIC-IV dataset to examine the fairness and bias in an XGBoost binary classification model predicting the ICU length of stay.
The research reveals class imbalances in the dataset across demographic attributes and employs data preprocessing and feature extraction.
The paper concludes with recommendations for fairness-aware machine learning techniques for mitigating biases and the need for collaborative efforts among healthcare professionals and data scientists.
arXiv Detail & Related papers (2023-12-31T16:01:48Z) - From Classification to Clinical Insights: Towards Analyzing and Reasoning About Mobile and Behavioral Health Data With Large Language Models [21.427976533706737]
We take a novel approach that leverages large language models to synthesize clinically useful insights from multi-sensor data.
We develop chain of thought prompting methods that use LLMs to generate reasoning about how trends in data relate to conditions like depression and anxiety.
We find models like GPT-4 correctly reference numerical data 75% of the time, and clinician participants express strong interest in using this approach to interpret self-tracking data.
arXiv Detail & Related papers (2023-11-21T23:53:27Z) - Large Language Models for Healthcare Data Augmentation: An Example on
Patient-Trial Matching [49.78442796596806]
We propose an innovative privacy-aware data augmentation approach for patient-trial matching (LLM-PTM)
Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%.
arXiv Detail & Related papers (2023-03-24T03:14:00Z) - Estimating and Improving Fairness with Adversarial Learning [65.99330614802388]
We propose an adversarial multi-task training strategy to simultaneously mitigate and detect bias in the deep learning-based medical image analysis system.
Specifically, we propose to add a discrimination module against bias and a critical module that predicts unfairness within the base classification model.
We evaluate our framework on a large-scale public-available skin lesion dataset.
arXiv Detail & Related papers (2021-03-07T03:10:32Z) - Clinical Outcome Prediction from Admission Notes using Self-Supervised
Knowledge Integration [55.88616573143478]
Outcome prediction from clinical text can prevent doctors from overlooking possible risks.
Diagnoses at discharge, procedures performed, in-hospital mortality and length-of-stay prediction are four common outcome prediction targets.
We propose clinical outcome pre-training to integrate knowledge about patient outcomes from multiple public sources.
arXiv Detail & Related papers (2021-02-08T10:26:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.