Related papers: Reducing Large Language Model Safety Risks in Women's Health using Semantic Entropy

Reducing Large Language Model Safety Risks in Women's Health using Semantic Entropy

URL: http://arxiv.org/abs/2503.00269v1
Date: Sat, 01 Mar 2025 00:57:52 GMT
Title: Reducing Large Language Model Safety Risks in Women's Health using Semantic Entropy
Authors: Jahan C. Penny-Dimri, Magdalena Bachmann, William R. Cooke, Sam Mathewlynn, Samuel Dockree, John Tolladay, Jannik Kossen, Lin Li, Yarin Gal, Gabriel Davis Jones,
Abstract summary: Large language models (LLMs) generate false or misleading outputs, known as hallucinations.<n>Traditional methods for quantifying uncertainty, such as perplexity, fail to capture meaning-level inconsistencies that lead to misinformation.<n>We evaluate semantic entropy (SE), a novel uncertainty metric, to detect hallucinations in AI-generated medical content.
Score: 29.14930590607661
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs) hold substantial promise for clinical decision support. However, their widespread adoption in medicine, particularly in healthcare, is hindered by their propensity to generate false or misleading outputs, known as hallucinations. In high-stakes domains such as women's health (obstetrics & gynaecology), where errors in clinical reasoning can have profound consequences for maternal and neonatal outcomes, ensuring the reliability of AI-generated responses is critical. Traditional methods for quantifying uncertainty, such as perplexity, fail to capture meaning-level inconsistencies that lead to misinformation. Here, we evaluate semantic entropy (SE), a novel uncertainty metric that assesses meaning-level variation, to detect hallucinations in AI-generated medical content. Using a clinically validated dataset derived from UK RCOG MRCOG examinations, we compared SE with perplexity in identifying uncertain responses. SE demonstrated superior performance, achieving an AUROC of 0.76 (95% CI: 0.75-0.78), compared to 0.62 (0.60-0.65) for perplexity. Clinical expert validation further confirmed its effectiveness, with SE achieving near-perfect uncertainty discrimination (AUROC: 0.97). While semantic clustering was successful in only 30% of cases, SE remains a valuable tool for improving AI safety in women's health. These findings suggest that SE could enable more reliable AI integration into clinical practice, particularly in resource-limited settings where LLMs could augment care. This study highlights the potential of SE as a key safeguard in the responsible deployment of AI-driven tools in women's health, leading to safer and more effective digital health interventions.

Related papers

Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
arXiv Detail & Related papers (2025-07-30T08:44:22Z)
The challenge of uncertainty quantification of large language models in medicine [0.0]
This study investigates uncertainty quantification in large language models (LLMs) for medical applications. Our research frames uncertainty not as a barrier but as an essential part of knowledge that invites a dynamic and reflective approach to AI design.
arXiv Detail & Related papers (2025-04-07T17:24:11Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references. We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey. Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Evaluating Spoken Language as a Biomarker for Automated Screening of Cognitive Impairment [37.40606157690235]
Alterations in speech and language can be early predictors of Alzheimer's disease and related dementias.<n>We evaluated machine learning techniques for ADRD screening and severity prediction from spoken language.<n>Risk stratification and linguistic feature importance analysis enhanced the interpretability and clinical utility of predictions.
arXiv Detail & Related papers (2025-01-30T20:17:17Z)
Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare [15.438265972219869]
Large language models (LLMs) are increasingly utilized in healthcare applications.<n>This study systematically assesses the vulnerabilities of seven LLMs to three advanced black-box jailbreaking techniques.
arXiv Detail & Related papers (2025-01-27T22:07:52Z)
Detecting Bias and Enhancing Diagnostic Accuracy in Large Language Models for Healthcare [0.2302001830524133]
Biased AI-generated medical advice and misdiagnoses can jeopardize patient safety. This study introduces new resources designed to promote ethical and precise AI in healthcare.
arXiv Detail & Related papers (2024-10-09T06:00:05Z)
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations [63.330182403615886]
A major barrier towards the practical deployment of large language models (LLMs) is their lack of reliability. Three situations where this is particularly apparent are correctness, hallucinations when given unanswerable questions, and safety. In all three cases, models should ideally abstain from responding, much like humans, whose ability to understand uncertainty makes us refrain from answering questions we don't know.
arXiv Detail & Related papers (2024-04-16T23:56:38Z)
Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form Medical Question Answering Applications and Beyond [52.246494389096654]
This paper introduces Word-Sequence Entropy (WSE), a method that calibrates uncertainty at both the word and sequence levels. We compare WSE with six baseline methods on five free-form medical QA datasets, utilizing seven popular large language models (LLMs)
arXiv Detail & Related papers (2024-02-22T03:46:08Z)
RAISE -- Radiology AI Safety, an End-to-end lifecycle approach [5.829180249228172]
The integration of AI into radiology introduces opportunities for improved clinical care provision and efficiency. The focus should be on ensuring models meet the highest standards of safety, effectiveness and efficacy. The roadmap presented herein aims to expedite the achievement of deployable, reliable, and safe AI in radiology.
arXiv Detail & Related papers (2023-11-24T15:59:14Z)
Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning. They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health. Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z)
Towards Reliable Medical Image Segmentation by utilizing Evidential Calibrated Uncertainty [52.03490691733464]
We introduce DEviS, an easily implementable foundational model that seamlessly integrates into various medical image segmentation networks. By leveraging subjective logic theory, we explicitly model probability and uncertainty for the problem of medical image segmentation. DeviS incorporates an uncertainty-aware filtering module, which utilizes the metric of uncertainty-calibrated error to filter reliable data.
arXiv Detail & Related papers (2023-01-01T05:02:46Z)
Can uncertainty boost the reliability of AI-based diagnostic methods in digital pathology? [3.8424737607413157]
We evaluate if adding uncertainty estimates for DL predictions in digital pathology could result in increased value for the clinical applications. We compare the effectiveness of model-integrated methods (MC dropout and Deep ensembles) with a model-agnostic approach. Our results show that uncertainty estimates can add some reliability and reduce sensitivity to classification threshold selection.
arXiv Detail & Related papers (2021-12-17T10:10:00Z)
UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model. UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data. We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD) UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.