Related papers: Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation

Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation

URL: http://arxiv.org/abs/2505.17095v1
Date: Wed, 21 May 2025 03:44:13 GMT
Title: Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation
Authors: Kristine Ann M. Carandang, Jasper Meynard P. Araña, Ethan Robert A. Casin, Christopher P. Monterola, Daniel Stanley Y. Tan, Jesus Felix B. Valenzuela, Christian M. Alis,
Abstract summary: This study evaluates the reliability of 12 open-weight and proprietary LLMs from Anthropic, Meta, Mistral, and OpenAI in CNG.<n>Overall, Meta's Llama 70B was the most reliable, followed by Mistral's Small model.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Due to the legal and ethical responsibilities of healthcare providers (HCPs) for accurate documentation and protection of patient data privacy, the natural variability in the responses of large language models (LLMs) presents challenges for incorporating clinical note generation (CNG) systems, driven by LLMs, into real-world clinical processes. The complexity is further amplified by the detailed nature of texts in CNG. To enhance the confidence of HCPs in tools powered by LLMs, this study evaluates the reliability of 12 open-weight and proprietary LLMs from Anthropic, Meta, Mistral, and OpenAI in CNG in terms of their ability to generate notes that are string equivalent (consistency rate), have the same meaning (semantic consistency) and are correct (semantic similarity), across several iterations using the same prompt. The results show that (1) LLMs from all model families are stable, such that their responses are semantically consistent despite being written in various ways, and (2) most of the LLMs generated notes close to the corresponding notes made by experts. Overall, Meta's Llama 70B was the most reliable, followed by Mistral's Small model. With these findings, we recommend the local deployment of these relatively smaller open-weight models for CNG to ensure compliance with data privacy regulations, as well as to improve the efficiency of HCPs in clinical documentation.

Related papers

Development and Validation of the Provider Documentation Summarization Quality Instrument for Large Language Models [3.0569643495382173]
The Provider Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries.<n> validation included Pearson correlation for substantive validity, factor analysis and Cronbach's alpha for structural validity.<n>The PDSQI-9 demonstrated strong internal consistency and high inter-rater reliability.
arXiv Detail & Related papers (2025-01-15T17:47:57Z)
Aligning Large Language Models for Faithful Integrity Against Opposing Argument [71.33552795870544]
Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks.<n>They can be easily misled by unfaithful arguments during conversations, even when their original statements are correct.<n>We propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation.
arXiv Detail & Related papers (2025-01-02T16:38:21Z)
Benchmarking LLMs and SLMs for patient reported outcomes [0.0]
This study benchmarks several SLMs against LLMs for summarizing patient-reported Q&A forms in the context of radiotherapy.<n>Using various metrics, we evaluate their precision and reliability.<n>The findings highlight both the promise and limitations of SLMs for high-stakes medical tasks, fostering more efficient and privacy-preserving AI-driven healthcare solutions.
arXiv Detail & Related papers (2024-12-20T19:01:25Z)
LLMs-in-the-Loop Part 2: Expert Small AI Models for Anonymization and De-identification of PHI Across Multiple Languages [0.0]
This paper introduces expert small AI models developed using the LLM-in-the-loop methodology to meet the demand for domain-specific de-identification NER models.<n>Our de-identification NER models, developed in eight languages, achieved f1-micro score averages of 0.966, 0.975, 0.976, 0.970, 0.964, 0.974, 0.978, and 0.953 respectively.
arXiv Detail & Related papers (2024-12-14T18:10:29Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
Beyond Self-Consistency: Ensemble Reasoning Boosts Consistency and Accuracy of LLMs in Cancer Staging [0.33554367023486936]
Cancer staging status is available in clinical reports, but it requires natural language processing to extract it. With the advance in clinical-oriented large language models, it is promising to extract such status without extensive efforts in training the algorithms. In this study, we propose an ensemble reasoning approach with the aim of improving the consistency of the model generations.
arXiv Detail & Related papers (2024-04-19T19:34:35Z)
Natural Language Programming in Medicine: Administering Evidence Based Clinical Workflows with Autonomous Agents Powered by Generative Large Language Models [29.05425041393475]
Generative Large Language Models (LLMs) hold significant promise in healthcare. This study assessed the potential of LLMs to function as autonomous agents in a simulated tertiary care medical center.
arXiv Detail & Related papers (2024-01-05T15:09:57Z)
LM-Polygraph: Uncertainty Estimation for Language Models [71.21409522341482]
Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of large language models (LLMs) We introduce LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python. It introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores.
arXiv Detail & Related papers (2023-11-13T15:08:59Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
Rethinking STS and NLI in Large Language Models [38.74393637449224]
We try to rethink semantic textual similarity and natural language inference. We first evaluate the performance of STS and NLI in the clinical/biomedical domain. We then assess LLMs' predictive confidence and their capability of capturing collective human opinions.
arXiv Detail & Related papers (2023-09-16T11:58:39Z)
Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning. They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health. Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z)
Large Language Models for Healthcare Data Augmentation: An Example on Patient-Trial Matching [49.78442796596806]
We propose an innovative privacy-aware data augmentation approach for patient-trial matching (LLM-PTM) Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%.
arXiv Detail & Related papers (2023-03-24T03:14:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.