A Comprehensive Survey on the Trustworthiness of Large Language Models in Healthcare
- URL: http://arxiv.org/abs/2502.15871v2
- Date: Wed, 17 Sep 2025 01:04:55 GMT
- Title: A Comprehensive Survey on the Trustworthiness of Large Language Models in Healthcare
- Authors: Manar Aljohani, Jun Hou, Sindhura Kommu, Xuan Wang,
- Abstract summary: The application of large language models (LLMs) in healthcare holds significant promise for enhancing clinical decision-making, medical research, and patient care.<n>Their integration into real-world clinical settings raises critical concerns around trustworthiness, particularly around dimensions of truthfulness, privacy, safety, robustness, fairness, and explainability.
- Score: 8.378348088931578
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The application of large language models (LLMs) in healthcare holds significant promise for enhancing clinical decision-making, medical research, and patient care. However, their integration into real-world clinical settings raises critical concerns around trustworthiness, particularly around dimensions of truthfulness, privacy, safety, robustness, fairness, and explainability. These dimensions are essential for ensuring that LLMs generate reliable, unbiased, and ethically sound outputs. While researchers have recently begun developing benchmarks and evaluation frameworks to assess LLM trustworthiness, the trustworthiness of LLMs in healthcare remains underexplored, lacking a systematic review that provides a comprehensive understanding and future insights. This survey addresses that gap by providing a comprehensive review of current methodologies and solutions aimed at mitigating risks across key trust dimensions. We analyze how each dimension affects the reliability and ethical deployment of healthcare LLMs, synthesize ongoing research efforts, and identify critical gaps in existing approaches. We also identify emerging challenges posed by evolving paradigms, such as multi-agent collaboration, multi-modal reasoning, and the development of small open-source medical models. Our goal is to guide future research toward more trustworthy, transparent, and clinically viable LLMs.
Related papers
- Towards Reliable Medical LLMs: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation [97.36081721024728]
We propose the first benchmark for assessing confidence in multi-turn interaction during realistic medical consultations.<n>Our benchmark unifies three types of medical data for open-ended diagnostic generation.<n>We present MedConf, an evidence-grounded linguistic self-assessment framework.
arXiv Detail & Related papers (2026-01-22T04:51:39Z) - Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation [51.19622266249408]
MultiTrust-X is a benchmark for evaluating, analyzing, and mitigating the trustworthiness issues of MLLMs.<n>Based on the taxonomy, MultiTrust-X includes 32 tasks and 28 curated datasets.<n>Our experiments reveal significant vulnerabilities in current models.
arXiv Detail & Related papers (2025-08-21T09:00:01Z) - Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
arXiv Detail & Related papers (2025-07-30T08:44:22Z) - Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings [51.73411055162861]
We introduce a safety evaluation protocol tailored to the medical domain in both patient user and clinician user perspectives.<n>This is the first work to define safety evaluation criteria for medical LLMs through targeted red-teaming taking three different points of view.
arXiv Detail & Related papers (2025-07-09T19:38:58Z) - The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs [57.1838332916627]
Large Language Models (LLMs) have shown remarkable capabilities in Natural Language Processing (NLP)<n>Their widespread deployment has also raised significant safety concerns.<n>LLMs-generated content can exhibit unsafe behaviors such as toxicity, bias, or misinformation, especially in adversarial contexts.
arXiv Detail & Related papers (2025-06-06T05:50:50Z) - Trustworthy Medical Question Answering: An Evaluation-Centric Survey [36.06747842975472]
We systematically examine six key dimensions of trustworthiness in medical question-answering systems.<n>We analyze evaluation-guided techniques that drive model improvements.<n>We propose future research directions to advance the safe, reliable, and transparent deployment of LLM-powered medical QA.
arXiv Detail & Related papers (2025-06-04T07:48:10Z) - Large Language Models for Cancer Communication: Evaluating Linguistic Quality, Safety, and Accessibility in Generative AI [0.40744588528519854]
Effective communication about breast and cervical cancers remains a persistent health challenge.<n>This study evaluates the capabilities and limitations of Large Language Models (LLMs) in generating accurate, safe, and accessible cancer-related information.
arXiv Detail & Related papers (2025-05-15T16:23:21Z) - Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.
Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.
We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z) - The challenge of uncertainty quantification of large language models in medicine [0.0]
This study investigates uncertainty quantification in large language models (LLMs) for medical applications.<n>Our research frames uncertainty not as a barrier but as an essential part of knowledge that invites a dynamic and reflective approach to AI design.
arXiv Detail & Related papers (2025-04-07T17:24:11Z) - Can LLMs Support Medical Knowledge Imputation? An Evaluation-Based Perspective [1.4913052010438639]
We have explored the use of Large Language Models (LLMs) for imputing missing treatment relationships.
LLMs offer promising capabilities in knowledge augmentation, but their application in medical knowledge imputation presents significant risks.
Our findings highlight critical limitations, including inconsistencies with established clinical guidelines and potential risks to patient safety.
arXiv Detail & Related papers (2025-03-29T02:52:17Z) - REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models [59.445672459851274]
REVAL is a comprehensive benchmark designed to evaluate the textbfREliability and textbfVALue of Large Vision-Language Models.
REVAL encompasses over 144K image-text Visual Question Answering (VQA) samples, structured into two primary sections: Reliability and Values.
We evaluate 26 models, including mainstream open-source LVLMs and prominent closed-source models like GPT-4o and Gemini-1.5-Pro.
arXiv Detail & Related papers (2025-03-20T07:54:35Z) - A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations [127.52707312573791]
This survey provides a comprehensive analysis of LVLM safety, covering key aspects such as attacks, defenses, and evaluation methods.
We introduce a unified framework that integrates these interrelated components, offering a holistic perspective on the vulnerabilities of LVLMs.
We conduct a set of safety evaluations on the latest LVLM, Deepseek Janus-Pro, and provide a theoretical analysis of the results.
arXiv Detail & Related papers (2025-02-14T08:42:43Z) - Large Language Models in Healthcare [4.119811542729794]
Large language models (LLMs) hold promise for transforming healthcare.
Their successful integration requires rigorous development, adaptation, and evaluation strategies tailored to clinical needs.
arXiv Detail & Related papers (2025-02-06T20:53:33Z) - A Mixed-Methods Evaluation of LLM-Based Chatbots for Menopause [7.156867036177255]
The integration of Large Language Models (LLMs) into healthcare settings has gained significant attention.<n>We examine the performance of publicly available LLM-based chatbots for menopause-related queries.<n>Our findings highlight the promise and limitations of traditional evaluation metrics for sensitive health topics.
arXiv Detail & Related papers (2025-02-05T19:56:52Z) - A Survey on the Honesty of Large Language Models [115.8458596738659]
Honesty is a fundamental principle for aligning large language models (LLMs) with human values.
Despite promising, current LLMs still exhibit significant dishonest behaviors.
arXiv Detail & Related papers (2024-09-27T14:34:54Z) - A Survey on Large Language Models for Critical Societal Domains: Finance, Healthcare, and Law [65.87885628115946]
Large language models (LLMs) are revolutionizing the landscapes of finance, healthcare, and law.
We highlight the instrumental role of LLMs in enhancing diagnostic and treatment methodologies in healthcare, innovating financial analytics, and refining legal interpretation and compliance strategies.
We critically examine the ethics for LLM applications in these fields, pointing out the existing ethical concerns and the need for transparent, fair, and robust AI systems.
arXiv Detail & Related papers (2024-05-02T22:43:02Z) - A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry [2.1717945745027425]
Large Language Models (LLMs) have evolved significantly, impacting various industries with their advanced capabilities in language understanding and generation.
This comprehensive survey delineates the extensive application and requisite evaluation of LLMs within healthcare.
Our survey is structured to provide an in-depth analysis of LLM applications across clinical settings, medical text data processing, research, education, and public health awareness.
arXiv Detail & Related papers (2024-04-24T09:55:24Z) - A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models [20.11590976578911]
Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities.
Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity.
We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions.
arXiv Detail & Related papers (2024-03-18T17:56:37Z) - MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models [32.35118292932457]
We first define the notion of medical safety in large language models (LLMs) based on the Principles of Medical Ethics set forth by the American Medical Association.
We then leverage this understanding to introduce MedSafetyBench, the first benchmark dataset designed to measure the medical safety of LLMs.
Our results show that publicly-available medical LLMs do not meet standards of medical safety and that fine-tuning them using MedSafetyBench improves their medical safety while preserving their medical performance.
arXiv Detail & Related papers (2024-03-06T14:34:07Z) - Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science [65.77763092833348]
Intelligent agents powered by large language models (LLMs) have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines.
While their capabilities are promising, these agents also introduce novel vulnerabilities that demand careful consideration for safety.
This paper conducts a thorough examination of vulnerabilities in LLM-based agents within scientific domains, shedding light on potential risks associated with their misuse and emphasizing the need for safety measures.
arXiv Detail & Related papers (2024-02-06T18:54:07Z) - TrustLLM: Trustworthiness in Large Language Models [446.5640421311468]
This paper introduces TrustLLM, a comprehensive study of trustworthiness in large language models (LLMs)
We first propose a set of principles for trustworthy LLMs that span eight different dimensions.
Based on these principles, we establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics.
arXiv Detail & Related papers (2024-01-10T22:07:21Z) - Medical Misinformation in AI-Assisted Self-Diagnosis: Development of a Method (EvalPrompt) for Analyzing Large Language Models [4.8775268199830935]
This study aims to assess the effectiveness of large language models (LLMs) as a self-diagnostic tool and their role in spreading healthcare misinformation.
We use open-ended questions to mimic real-world self-diagnosis use cases, and perform sentence dropout to mimic realistic self-diagnosis with missing information.
The results highlight the modest capabilities of LLMs, as their responses are often unclear and inaccurate.
arXiv Detail & Related papers (2023-07-10T21:28:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.