Development and Validation of the Provider Documentation Summarization Quality Instrument for Large Language Models
- URL: http://arxiv.org/abs/2501.08977v2
- Date: Fri, 17 Jan 2025 19:10:00 GMT
- Title: Development and Validation of the Provider Documentation Summarization Quality Instrument for Large Language Models
- Authors: Emma Croxford, Yanjun Gao, Nicholas Pellegrino, Karen K. Wong, Graham Wills, Elliot First, Miranda Schnier, Kyle Burton, Cris G. Ebby, Jillian Gorskic, Matthew Kalscheur, Samy Khalil, Marie Pisani, Tyler Rubeor, Peter Stetson, Frank Liao, Cherodeep Goswami, Brian Patterson, Majid Afshar,
- Abstract summary: The Provider Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries.<n> validation included Pearson correlation for substantive validity, factor analysis and Cronbach's alpha for structural validity.<n>The PDSQI-9 demonstrated strong internal consistency and high inter-rater reliability.
- Score: 3.0569643495382173
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As Large Language Models (LLMs) are integrated into electronic health record (EHR) workflows, validated instruments are essential to evaluate their performance before implementation. Existing instruments for provider documentation quality are often unsuitable for the complexities of LLM-generated text and lack validation on real-world data. The Provider Documentation Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries. Multi-document summaries were generated from real-world EHR data across multiple specialties using several LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearson correlation for substantive validity, factor analysis and Cronbach's alpha for structural validity, inter-rater reliability (ICC and Krippendorff's alpha) for generalizability, a semi-Delphi process for content validity, and comparisons of high-versus low-quality summaries for discriminant validity. Seven physician raters evaluated 779 summaries and answered 8,329 questions, achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstrated strong internal consistency (Cronbach's alpha = 0.879; 95% CI: 0.867-0.891) and high inter-rater reliability (ICC = 0.867; 95% CI: 0.867-0.868), supporting structural validity and generalizability. Factor analysis identified a 4-factor model explaining 58% of the variance, representing organization, clarity, accuracy, and utility. Substantive validity was supported by correlations between note length and scores for Succinct (rho = -0.200, p = 0.029) and Organized ($\rho = -0.190$, $p = 0.037$). Discriminant validity distinguished high- from low-quality summaries ($p < 0.001$). The PDSQI-9 demonstrates robust construct validity, supporting its use in clinical practice to evaluate LLM-generated summaries and facilitate safer integration of LLMs into healthcare workflows.
Related papers
- Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies [11.0505830548286]
This study introduces a granular error taxonomy through systematic analysis of top 10 models on MedBench.
Evaluation of 10 leading models reveals vulnerabilities, despite achieving 0.86 accuracy in medical knowledge recall.
Our analysis uncovers systemic weaknesses in knowledge boundary enforcement and multi-step reasoning.
arXiv Detail & Related papers (2025-03-10T13:28:25Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.
We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.
Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models [79.41859481668618]
Large Language Models (LLMs) have significantly advanced the fact-checking studies.
Existing automated fact-checking evaluation methods rely on static datasets and classification metrics.
We introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs' fact-checking capabilities.
arXiv Detail & Related papers (2025-02-25T07:44:22Z) - VeriFact: Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records [2.8078482678056527]
VeriFact is an artificial intelligence system for fact-checking large language models (LLM) in clinical medicine.
It decomposes Brief Hospital Course narratives into simple statements with clinician annotations for whether each statement is supported by the patient's EHR clinical notes.
It achieves up to 92.7% agreement when compared to a denoised and adjudicated average human clinican ground truth.
arXiv Detail & Related papers (2025-01-28T03:13:16Z) - Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs) [0.5434005537854512]
This study explored the potential of large language models (LLMs) to automate OSCE evaluations using the Master Interview Rating Scale (MIRS)
We compared the performance of four state-of-the-art LLMs in evaluating OSCE transcripts across all 28 items of the MIRS under the conditions of zero-shot, chain-of-thought (CoT), few-shot, and multi-step prompting.
arXiv Detail & Related papers (2025-01-21T04:05:45Z) - A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look [52.114284476700874]
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed.
We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness.
Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
arXiv Detail & Related papers (2024-11-13T01:12:35Z) - AIPatient: Simulating Patients with EHRs and LLM Powered Agentic Workflow [33.8495939261319]
We develop an advanced simulated patient system with AIPatient Knowledge Graph (AIPatient KG) as the input and Reasoning Retrieval-Augmented Generation (Reasoning RAG) as the generation backbone.
Reasoning RAG leverages six LLM powered agents spanning tasks including retrieval, KG query generation, abstraction, checker, rewrite, and summarization.
Our system also presents high readability (median Flesch Reading Ease 77.23; median Flesch Kincaid Grade 5.6), robustness (ANOVA F-value 0.6126, p>0.1), and stability (ANOVA F-value 0.782, p>0.1)
arXiv Detail & Related papers (2024-09-27T17:17:15Z) - Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - Synchronous Faithfulness Monitoring for Trustworthy Retrieval-Augmented Generation [96.78845113346809]
Retrieval-augmented language models (RALMs) have shown strong performance and wide applicability in knowledge-intensive tasks.
This paper proposes SynCheck, a lightweight monitor that leverages fine-grained decoding dynamics to detect unfaithful sentences.
We also introduce FOD, a faithfulness-oriented decoding algorithm guided by beam search for long-form retrieval-augmented generation.
arXiv Detail & Related papers (2024-06-19T16:42:57Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.<n>Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.<n>AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries [9.031182965159976]
Large Language Models (LLMs) show promise in efficiently analyzing vast and complex data.
We introduce EHRNoteQA, a novel benchmark built on the MIMIC-IV EHR, comprising 962 different QA pairs each linked to distinct patients' discharge summaries.
EHRNoteQA includes questions that require information across multiple discharge summaries and covers eight diverse topics, mirroring the complexity and diversity of real clinical inquiries.
arXiv Detail & Related papers (2024-02-25T09:41:50Z) - LongHealth: A Question Answering Benchmark with Long Clinical Documents [36.05587855811346]
We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases.
The benchmark challenges LLMs with 400 multiple-choice questions in three categories: information extraction, negation, and sorting.
We evaluated nine open-source LLMs with a minimum of 16,000 tokens and also included OpenAI's proprietary and cost-efficient GPT-3.5 Turbo for comparison.
arXiv Detail & Related papers (2024-01-25T19:57:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.