SPEER: Sentence-Level Planning of Long Clinical Summaries via Embedded
Entity Retrieval
- URL: http://arxiv.org/abs/2401.02369v1
- Date: Thu, 4 Jan 2024 17:23:44 GMT
- Title: SPEER: Sentence-Level Planning of Long Clinical Summaries via Embedded
Entity Retrieval
- Authors: Griffin Adams, Jason Zucker, No\'emie Elhadad
- Abstract summary: Clinician must write a lengthy summary each time a patient is discharged from the hospital.
Identifying and covering salient entities is vital for the summary to be clinically useful.
We fine-tune open-source LLMs on the task and find that they generate incomplete and unfaithful summaries.
- Score: 5.669063174637433
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Clinician must write a lengthy summary each time a patient is discharged from
the hospital. This task is time-consuming due to the sheer number of unique
clinical concepts covered in the admission. Identifying and covering salient
entities is vital for the summary to be clinically useful. We fine-tune
open-source LLMs (Mistral-7B-Instruct and Zephyr-7B-\b{eta}) on the task and
find that they generate incomplete and unfaithful summaries. To increase entity
coverage, we train a smaller, encoder-only model to predict salient entities,
which are treated as content-plans to guide the LLM. To encourage the LLM to
focus on specific mentions in the source notes, we propose SPEER:
Sentence-level Planning via Embedded Entity Retrieval. Specifically, we mark
each salient entity span with special "{{ }}" boundary tags and instruct the
LLM to retrieve marked spans before generating each sentence. Sentence-level
planning acts as a form of state tracking in that the model is explicitly
recording the entities it uses. We fine-tune Mistral and Zephyr variants on a
large-scale, diverse dataset of ~167k in-patient hospital admissions and
evaluate on 3 datasets. SPEER shows gains in both coverage and faithfulness
metrics over non-guided and guided baselines.
Related papers
- SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - FineSurE: Fine-grained Summarization Evaluation using LLMs [22.62504593575933]
FineSurE is a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs)
It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment.
arXiv Detail & Related papers (2024-07-01T02:20:28Z) - Generating Faithful and Complete Hospital-Course Summaries from the Electronic Health Record [3.6513957125331555]
An unintended consequence of the increased documentation burden has been reduced face-time with patients.
We propose and evaluate automated solutions for generating a summary of a patient's hospital admissions.
arXiv Detail & Related papers (2024-04-01T15:47:21Z) - Developing Healthcare Language Model Embedding Spaces [0.20971479389679337]
Pre-trained Large Language Models (LLMs) often struggle on out-of-domain datasets like healthcare focused text.
Three methods are assessed: traditional masked language modeling, Deep Contrastive Learning for Unsupervised Textual Representations (DeCLUTR) and a novel pre-training objective utilizing metadata categories from the healthcare settings.
Contrastively trained models outperform other approaches on the classification tasks, delivering strong performance from limited labeled data and with fewer model parameter updates required.
arXiv Detail & Related papers (2024-03-28T19:31:32Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text
Summaries [62.32403630651586]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.
Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.
AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - Extrinsically-Focused Evaluation of Omissions in Medical Summarization [10.02553223045504]
We propose MED-OMIT, a new omission benchmark for medical summarization.
Given a doctor-patient conversation and a generated summary, MED-OMIT categorizes the chat into a set of facts and identifies which are omitted from the summary.
We evaluate MED-OMIT on a publicly-released dataset of patient-doctor conversations and find that MED-OMIT captures omissions better than alternative metrics.
arXiv Detail & Related papers (2023-11-14T16:46:15Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching.
We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders.
We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z) - CLIP: A Dataset for Extracting Action Items for Physicians from Hospital
Discharge Notes [17.107315598110183]
We create a dataset of clinical action items annotated over MIMIC-III, the largest publicly available dataset of real clinical notes.
This dataset, which we call CLIP, is annotated by physicians and covers documents representing 100K sentences.
We describe the task of extracting the action items from these documents as multi-aspect extractive summarization, with each aspect representing a type of action to be taken.
arXiv Detail & Related papers (2021-06-04T14:49:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.