Attribute Structuring Improves LLM-Based Evaluation of Clinical Text
Summaries
- URL: http://arxiv.org/abs/2403.01002v1
- Date: Fri, 1 Mar 2024 21:59:03 GMT
- Title: Attribute Structuring Improves LLM-Based Evaluation of Clinical Text
Summaries
- Authors: Zelalem Gero, Chandan Singh, Yiqing Xie, Sheng Zhang, Tristan Naumann,
Jianfeng Gao, Hoifung Poon
- Abstract summary: Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.
Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.
AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
- Score: 62.32403630651586
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Summarizing clinical text is crucial in health decision-support and clinical
research. Large language models (LLMs) have shown the potential to generate
accurate clinical text summaries, but still struggle with issues regarding
grounding and evaluation, especially in safety-critical domains such as health.
Holistically evaluating text summaries is challenging because they may contain
unsubstantiated information. Here, we explore a general mitigation framework
using Attribute Structuring (AS), which structures the summary evaluation
process. It decomposes the evaluation process into a grounded procedure that
uses an LLM for relatively simple structuring and scoring tasks, rather than
the full task of holistic summary evaluation. Experiments show that AS
consistently improves the correspondence between human annotations and
automated metrics in clinical text summarization. Additionally, AS yields
interpretations in the form of a short text span corresponding to each output,
which enables efficient human auditing, paving the way towards trustworthy
evaluation of clinical information in resource-constrained scenarios. We
release our code, prompts, and an open-source benchmark at
https://github.com/microsoft/attribute-structuring.
Related papers
- FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence [46.71469172542448]
This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts.
It consists of 345 plain language summaries of abstracts generated from three randomized controlled trials (RCTs)
We assess the factuality of critical elements of RCTs in those summaries, as well as the reported findings concerning these.
arXiv Detail & Related papers (2024-02-18T04:45:01Z) - Pyclipse, a library for deidentification of free-text clinical notes [0.40329768057075643]
We propose the pyclipse framework to streamline the comparison of deidentification algorithms.
Pyclipse serves as a single interface for running open-source deidentification algorithms on local clinical data.
We find that algorithm performance consistently falls short of the results reported in the original papers, even when evaluated on the same benchmark dataset.
arXiv Detail & Related papers (2023-11-05T19:56:58Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course
Summarization [2.8575516056239576]
Long-form clinical summarization of hospital admissions has real-world significance because of its potential to help both clinicians and patients.
We benchmark faithfulness metrics against fine-grained human annotations for model-generated summaries of a patient's Brief Hospital Course.
arXiv Detail & Related papers (2023-03-07T14:57:06Z) - Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching.
We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders.
We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.