A Benchmark of Domain-Adapted Large Language Models for Generating Brief
Hospital Course Summaries
- URL: http://arxiv.org/abs/2403.05720v1
- Date: Fri, 8 Mar 2024 23:17:55 GMT
- Title: A Benchmark of Domain-Adapted Large Language Models for Generating Brief
Hospital Course Summaries
- Authors: Asad Aali, Dave Van Veen, Yamin Ishraq Arefeen, Jason Hom, Christian
Bluethgen, Eduardo Pontes Reis, Sergios Gatidis, Namuun Clifford, Joseph
Daws, Arash S. Tehrani, Jangwon Kim, Akshay S. Chaudhari
- Abstract summary: Large language models (LLMs) depict remarkable capabilities in automating real-world tasks, but their capabilities for healthcare applications have not been shown.
We introduce a novel benchmark consisting of a pre-processed dataset extracted from MIMIC-IV notes.
We assess the performance of two general-purpose LLMs and three healthcare-adapted LLMs to improve BHC synthesis from clinical notes.
- Score: 4.201332098927781
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Brief hospital course (BHC) summaries are common clinical documents generated
by summarizing clinical notes. While large language models (LLMs) depict
remarkable capabilities in automating real-world tasks, their capabilities for
healthcare applications such as BHC synthesis have not been shown. To enable
the adaptation of LLMs for BHC synthesis, we introduce a novel benchmark
consisting of a pre-processed dataset extracted from MIMIC-IV notes,
encapsulating clinical note, and brief hospital course (BHC) pairs. We assess
the performance of two general-purpose LLMs and three healthcare-adapted LLMs
to improve BHC synthesis from clinical notes. Using clinical notes as input for
generating BHCs, we apply prompting-based (using in-context learning) and
fine-tuning-based adaptation strategies to three open-source LLMs
(Clinical-T5-Large, Llama2-13B, FLAN-UL2) and two proprietary LLMs (GPT-3.5,
GPT-4). We quantitatively evaluate the performance of these LLMs across varying
context-length inputs using conventional natural language similarity metrics.
We further perform a qualitative study where five diverse clinicians blindly
compare clinician-written BHCs and two LLM-generated BHCs for 30 samples across
metrics of comprehensiveness, conciseness, factual correctness, and fluency.
Overall, we present a new benchmark and pre-processed dataset for using LLMs in
BHC synthesis from clinical notes. We observe high-quality summarization
performance for both in-context proprietary and fine-tuned open-source LLMs
using both quantitative metrics and a qualitative clinical reader study. We
propose our work as a benchmark to motivate future works to adapt and assess
the performance of LLMs in BHC synthesis.
Related papers
- Adapting Open-Source Large Language Models for Cost-Effective, Expert-Level Clinical Note Generation with On-Policy Reinforcement Learning [19.08691249610632]
This study presents a comprehensive domain- and task-specific adaptation process for the open-source LLaMA-2 13 billion parameter model.
We introduce a new approach, DistillDirect, for performing on-policy reinforcement learning with Gemini 1.0 Pro as the teacher model.
Our model, LLaMA-Clinic, can generate clinical notes comparable in quality to those authored by physicians.
arXiv Detail & Related papers (2024-04-25T15:34:53Z) - CLUE: A Clinical Language Understanding Evaluation for LLMs [2.3814275542331385]
Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes.
Assessing the models' suitability for this sensitive application area is of utmost importance.
We present the Clinical Language Understanding Evaluation (CLUE), a benchmark tailored to evaluate LLMs on clinical tasks.
arXiv Detail & Related papers (2024-04-05T12:51:37Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text
Summaries [62.32403630651586]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.
Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.
AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - From RAGs to riches: Using large language models to write documents for
clinical trials [0.0]
Large language models (LLMs) offer the potential to rapidly generate first versions of clinical trial documents.
We report an evaluation of LLMs in generating parts of one such document, clinical trial protocols.
To improve performance, we used retrieval-augmented generation (RAG) to prompt an LLM with accurate up-to-date information.
arXiv Detail & Related papers (2024-02-26T08:59:05Z) - EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries [9.031182965159976]
Large Language Models (LLMs) show promise in efficiently analyzing vast and complex data.
We introduce EHRNoteQA, a novel benchmark built on the MIMIC-IV EHR, comprising 962 different QA pairs each linked to distinct patients' discharge summaries.
EHRNoteQA includes questions that require information across multiple discharge summaries and covers eight diverse topics, mirroring the complexity and diversity of real clinical inquiries.
arXiv Detail & Related papers (2024-02-25T09:41:50Z) - Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization [8.456700096020601]
Large language models (LLMs) have shown promise in natural language processing (NLP), but their effectiveness on a diverse range of clinical summarization tasks remains unproven.
In this study, we apply adaptation methods to eight LLMs, spanning four distinct clinical summarization tasks.
A clinical reader study with ten physicians evaluates summary, completeness, correctness, and conciseness; in a majority of cases, summaries from our best adapted LLMs are either equivalent (45%) or superior (36%) compared to summaries from medical experts.
arXiv Detail & Related papers (2023-09-14T05:15:01Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - Large Language Models for Healthcare Data Augmentation: An Example on
Patient-Trial Matching [49.78442796596806]
We propose an innovative privacy-aware data augmentation approach for patient-trial matching (LLM-PTM)
Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%.
arXiv Detail & Related papers (2023-03-24T03:14:00Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - Zero-Shot Cross-Lingual Summarization via Large Language Models [108.30673793281987]
Cross-lingual summarization ( CLS) generates a summary in a different target language.
Recent emergence of Large Language Models (LLMs) has attracted wide attention from the computational linguistics community.
In this report, we empirically use various prompts to guide LLMs to perform zero-shot CLS from different paradigms.
arXiv Detail & Related papers (2023-02-28T01:27:37Z) - Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching.
We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders.
We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.