Towards objectively evaluating the quality of generated medical
summaries
- URL: http://arxiv.org/abs/2104.04412v1
- Date: Fri, 9 Apr 2021 15:02:56 GMT
- Title: Towards objectively evaluating the quality of generated medical
summaries
- Authors: Francesco Moramarco, Damir Juric, Aleksandar Savkov, Ehud Reiter
- Abstract summary: We ask evaluators to count facts, computing precision, recall, f-score, and accuracy from the raw counts.
We apply this to the task of medical report summarisation, where measuring objective quality and accuracy is of paramount importance.
- Score: 70.09940409175998
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a method for evaluating the quality of generated text by asking
evaluators to count facts, and computing precision, recall, f-score, and
accuracy from the raw counts. We believe this approach leads to a more
objective and easier to reproduce evaluation. We apply this to the task of
medical report summarisation, where measuring objective quality and accuracy is
of paramount importance.
Related papers
- A Critical Look at Meta-evaluating Summarisation Evaluation Metrics [11.541368732416506]
We argue that the time is ripe to build more diverse benchmarks that enable the development of more robust evaluation metrics.
We call for research focusing on user-centric quality dimensions that consider the generated summary's communicative goal.
arXiv Detail & Related papers (2024-09-29T01:30:13Z) - DeepScore: A Comprehensive Approach to Measuring Quality in AI-Generated Clinical Documentation [0.0]
This paper presents an overview of DeepScribe's methodologies for assessing and managing note quality.
These methodologies aim to enhance the quality of patient care documentation through accountability and continuous improvement.
arXiv Detail & Related papers (2024-09-10T23:06:48Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text
Summaries [62.32403630651586]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.
Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.
AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - Revisiting Automatic Question Summarization Evaluation in the Biomedical
Domain [45.78632945525459]
We conduct human evaluations of summarization quality from four different aspects of a biomedical question summarization task.
Based on human judgments, we identify different noteworthy features for current automatic metrics and summarization systems.
arXiv Detail & Related papers (2023-03-18T04:28:01Z) - FactReranker: Fact-guided Reranker for Faithful Radiology Report
Summarization [42.7555185736215]
We propose FactReranker, which learns to choose the best summary from all candidates based on their estimated factual consistency score.
We decompose the fact-guided reranker into the factual knowledge graph generation and the factual scorer.
Experimental results on two benchmark datasets demonstrate the superiority of our method in generating summaries with higher factual consistency scores.
arXiv Detail & Related papers (2023-03-15T02:51:57Z) - Consultation Checklists: Standardising the Human Evaluation of Medical
Note Generation [58.54483567073125]
We propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists.
We observed good levels of inter-annotator agreement in a first evaluation study using the protocol.
arXiv Detail & Related papers (2022-11-17T10:54:28Z) - Investigating Crowdsourcing Protocols for Evaluating the Factual
Consistency of Summaries [59.27273928454995]
Current pre-trained models applied to summarization are prone to factual inconsistencies which misrepresent the source text or introduce extraneous information.
We create a crowdsourcing evaluation framework for factual consistency using the rating-based Likert scale and ranking-based Best-Worst Scaling protocols.
We find that ranking-based protocols offer a more reliable measure of summary quality across datasets, while the reliability of Likert ratings depends on the target dataset and the evaluation design.
arXiv Detail & Related papers (2021-09-19T19:05:00Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Understanding the Extent to which Summarization Evaluation Metrics
Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries.
We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.