A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course
Summarization
- URL: http://arxiv.org/abs/2303.03948v1
- Date: Tue, 7 Mar 2023 14:57:06 GMT
- Title: A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course
Summarization
- Authors: Griffin Adams, Jason Zucker, No\'emie Elhadad
- Abstract summary: Long-form clinical summarization of hospital admissions has real-world significance because of its potential to help both clinicians and patients.
We benchmark faithfulness metrics against fine-grained human annotations for model-generated summaries of a patient's Brief Hospital Course.
- Score: 2.8575516056239576
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long-form clinical summarization of hospital admissions has real-world
significance because of its potential to help both clinicians and patients. The
faithfulness of summaries is critical to their safe usage in clinical settings.
To better understand the limitations of abstractive systems, as well as the
suitability of existing evaluation metrics, we benchmark faithfulness metrics
against fine-grained human annotations for model-generated summaries of a
patient's Brief Hospital Course. We create a corpus of patient hospital
admissions and summaries for a cohort of HIV patients, each with complex
medical histories. Annotators are presented with summaries and source notes,
and asked to categorize manually highlighted summary elements (clinical
entities like conditions and medications as well as actions like "following
up") into one of three categories: ``Incorrect,'' ``Missing,'' and ``Not in
Notes.'' We meta-evaluate a broad set of proposed faithfulness metrics and,
across metrics, explore the importance of domain adaptation (e.g. the impact of
in-domain pre-training and metric fine-tuning), the use of source-summary
alignments, and the effects of distilling a single metric from an ensemble of
pre-existing metrics. Off-the-shelf metrics with no exposure to clinical text
correlate well yet overly rely on summary extractiveness. As a practical guide
to long-form clinical narrative summarization, we find that most metrics
correlate best to human judgments when provided with one summary sentence at a
time and a minimal set of relevant source context.
Related papers
- Every Component Counts: Rethinking the Measure of Success for Medical Semantic Segmentation in Multi-Instance Segmentation Tasks [60.80828925396154]
We present Connected-Component(CC)-Metrics, a novel semantic segmentation evaluation protocol.
We motivate this setup in the common medical scenario of semantic segmentation in a full-body PET/CT.
We show how existing semantic segmentation metrics suffer from a bias towards larger connected components.
arXiv Detail & Related papers (2024-10-24T12:26:05Z) - FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE)
FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary.
Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text
Summaries [62.32403630651586]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.
Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.
AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - Extrinsically-Focused Evaluation of Omissions in Medical Summarization [10.02553223045504]
We propose MED-OMIT, a new omission benchmark for medical summarization.
Given a doctor-patient conversation and a generated summary, MED-OMIT categorizes the chat into a set of facts and identifies which are omitted from the summary.
We evaluate MED-OMIT on a publicly-released dataset of patient-doctor conversations and find that MED-OMIT captures omissions better than alternative metrics.
arXiv Detail & Related papers (2023-11-14T16:46:15Z) - Making the Most Out of the Limited Context Length: Predictive Power
Varies with Clinical Note Type and Note Section [70.37720062263176]
We propose a framework to analyze the sections with high predictive power.
Using MIMIC-III, we show that: 1) predictive power distribution is different between nursing notes and discharge notes and 2) combining different types of notes could improve performance when the context length is large.
arXiv Detail & Related papers (2023-07-13T20:04:05Z) - An Investigation of Evaluation Metrics for Automated Medical Note
Generation [5.094623170336122]
We study evaluation methods and metrics for the automatic generation of clinical notes from medical conversations.
To study the correlation between the automatic metrics and manual judgments, we evaluate automatic notes/summaries by comparing the system and reference facts.
arXiv Detail & Related papers (2023-05-27T04:34:58Z) - Generating medically-accurate summaries of patient-provider dialogue: A
multi-stage approach using large language models [6.252236971703546]
An effective summary is required to be coherent and accurately capture all the medically relevant information in the dialogue.
This paper tackles the problem of medical conversation summarization by discretizing the task into several smaller dialogue-understanding tasks.
arXiv Detail & Related papers (2023-05-10T08:48:53Z) - Human Evaluation and Correlation with Automatic Metrics in Consultation
Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes.
We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors.
We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z) - Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching.
We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders.
We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z) - Towards Clinical Encounter Summarization: Learning to Compose Discharge
Summaries from Prior Notes [15.689048077818324]
This paper introduces the task of generating discharge summaries for a clinical encounter.
We introduce two new measures, faithfulness and hallucination rate for evaluation.
Results across seven medical sections and five models show that a summarization architecture that supports traceability yields promising results.
arXiv Detail & Related papers (2021-04-27T22:45:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.