Related papers: A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization

A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization

URL: http://arxiv.org/abs/2303.03948v1
Date: Tue, 7 Mar 2023 14:57:06 GMT
Title: A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization
Authors: Griffin Adams, Jason Zucker, No\'emie Elhadad
Abstract summary: Long-form clinical summarization of hospital admissions has real-world significance because of its potential to help both clinicians and patients. We benchmark faithfulness metrics against fine-grained human annotations for model-generated summaries of a patient's Brief Hospital Course.
Score: 2.8575516056239576
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-form clinical summarization of hospital admissions has real-world significance because of its potential to help both clinicians and patients. The faithfulness of summaries is critical to their safe usage in clinical settings. To better understand the limitations of abstractive systems, as well as the suitability of existing evaluation metrics, we benchmark faithfulness metrics against fine-grained human annotations for model-generated summaries of a patient's Brief Hospital Course. We create a corpus of patient hospital admissions and summaries for a cohort of HIV patients, each with complex medical histories. Annotators are presented with summaries and source notes, and asked to categorize manually highlighted summary elements (clinical entities like conditions and medications as well as actions like "following up") into one of three categories: ``Incorrect,'' ``Missing,'' and ``Not in Notes.'' We meta-evaluate a broad set of proposed faithfulness metrics and, across metrics, explore the importance of domain adaptation (e.g. the impact of in-domain pre-training and metric fine-tuning), the use of source-summary alignments, and the effects of distilling a single metric from an ensemble of pre-existing metrics. Off-the-shelf metrics with no exposure to clinical text correlate well yet overly rely on summary extractiveness. As a practical guide to long-form clinical narrative summarization, we find that most metrics correlate best to human judgments when provided with one summary sentence at a time and a minimal set of relevant source context.

Related papers

Every Component Counts: Rethinking the Measure of Success for Medical Semantic Segmentation in Multi-Instance Segmentation Tasks [60.80828925396154]
We present Connected-Component(CC)-Metrics, a novel semantic segmentation evaluation protocol. We motivate this setup in the common medical scenario of semantic segmentation in a full-body PET/CT. We show how existing semantic segmentation metrics suffer from a bias towards larger connected components.
arXiv Detail & Related papers (2024-10-24T12:26:05Z)
uMedSum: A Unified Framework for Advancing Medical Abstractive Summarization [23.173826980480936]
Current methods often sacrifice key information for faithfulness or introduce confabulations when prioritizing informativeness. This paper presents a benchmark of six advanced abstractive summarization methods across three diverse datasets using five standardized metrics. We propose uMedSum, a modular hybrid summarization framework that introduces novel approaches for sequential confabulation removal followed by key missing information addition.
arXiv Detail & Related papers (2024-08-22T03:08:49Z)
FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE) FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z)
Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [62.32403630651586]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation. Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process. AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z)
Making the Most Out of the Limited Context Length: Predictive Power Varies with Clinical Note Type and Note Section [70.37720062263176]
We propose a framework to analyze the sections with high predictive power. Using MIMIC-III, we show that: 1) predictive power distribution is different between nursing notes and discharge notes and 2) combining different types of notes could improve performance when the context length is large.
arXiv Detail & Related papers (2023-07-13T20:04:05Z)
An Investigation of Evaluation Metrics for Automated Medical Note Generation [5.094623170336122]
We study evaluation methods and metrics for the automatic generation of clinical notes from medical conversations. To study the correlation between the automatic metrics and manual judgments, we evaluate automatic notes/summaries by comparing the system and reference facts.
arXiv Detail & Related papers (2023-05-27T04:34:58Z)
Generating medically-accurate summaries of patient-provider dialogue: A multi-stage approach using large language models [6.252236971703546]
An effective summary is required to be coherent and accurately capture all the medically relevant information in the dialogue. This paper tackles the problem of medical conversation summarization by discretizing the task into several smaller dialogue-understanding tasks.
arXiv Detail & Related papers (2023-05-10T08:48:53Z)
Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes. We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z)
Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching. We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders. We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z)
Towards Clinical Encounter Summarization: Learning to Compose Discharge Summaries from Prior Notes [15.689048077818324]
This paper introduces the task of generating discharge summaries for a clinical encounter. We introduce two new measures, faithfulness and hallucination rate for evaluation. Results across seven medical sections and five models show that a summarization architecture that supports traceability yields promising results.
arXiv Detail & Related papers (2021-04-27T22:45:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.