An Investigation of Evaluation Metrics for Automated Medical Note
Generation
- URL: http://arxiv.org/abs/2305.17364v1
- Date: Sat, 27 May 2023 04:34:58 GMT
- Title: An Investigation of Evaluation Metrics for Automated Medical Note
Generation
- Authors: Asma Ben Abacha and Wen-wai Yim and George Michalopoulos and Thomas
Lin
- Abstract summary: We study evaluation methods and metrics for the automatic generation of clinical notes from medical conversations.
To study the correlation between the automatic metrics and manual judgments, we evaluate automatic notes/summaries by comparing the system and reference facts.
- Score: 5.094623170336122
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies on automatic note generation have shown that doctors can save
significant amounts of time when using automatic clinical note generation
(Knoll et al., 2022). Summarization models have been used for this task to
generate clinical notes as summaries of doctor-patient conversations (Krishna
et al., 2021; Cai et al., 2022). However, assessing which model would best
serve clinicians in their daily practice is still a challenging task due to the
large set of possible correct summaries, and the potential limitations of
automatic evaluation metrics. In this paper, we study evaluation methods and
metrics for the automatic generation of clinical notes from medical
conversations. In particular, we propose new task-specific metrics and we
compare them to SOTA evaluation metrics in text summarization and generation,
including: (i) knowledge-graph embedding-based metrics, (ii) customized
model-based metrics, (iii) domain-adapted/fine-tuned metrics, and (iv) ensemble
metrics. To study the correlation between the automatic metrics and manual
judgments, we evaluate automatic notes/summaries by comparing the system and
reference facts and computing the factual correctness, and the hallucination
and omission rates for critical medical facts. This study relied on seven
datasets manually annotated by domain experts. Our experiments show that
automatic evaluation metrics can have substantially different behaviors on
different types of clinical notes datasets. However, the results highlight one
stable subset of metrics as the most correlated with human judgments with a
relevant aggregation of different evaluation criteria.
Related papers
- Improving Clinical Note Generation from Complex Doctor-Patient Conversation [20.2157016701399]
We present three key contributions to the field of clinical note generation using large language models (LLMs)
First, we introduce CliniKnote, a dataset consisting of 1,200 complex doctor-patient conversations paired with their full clinical notes.
Second, we propose K-SOAP, which enhances traditional SOAPcitepodder20soap (Subjective, Objective, Assessment, and Plan) notes by adding a keyword section at the top, allowing for quick identification of essential information.
Third, we develop an automatic pipeline to generate K-SOAP notes from doctor-patient conversations and benchmark various modern LLMs using various
arXiv Detail & Related papers (2024-08-26T18:39:31Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - Revisiting Automatic Question Summarization Evaluation in the Biomedical
Domain [45.78632945525459]
We conduct human evaluations of summarization quality from four different aspects of a biomedical question summarization task.
Based on human judgments, we identify different noteworthy features for current automatic metrics and summarization systems.
arXiv Detail & Related papers (2023-03-18T04:28:01Z) - A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course
Summarization [2.8575516056239576]
Long-form clinical summarization of hospital admissions has real-world significance because of its potential to help both clinicians and patients.
We benchmark faithfulness metrics against fine-grained human annotations for model-generated summaries of a patient's Brief Hospital Course.
arXiv Detail & Related papers (2023-03-07T14:57:06Z) - The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - Human Evaluation and Correlation with Automatic Metrics in Consultation
Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes.
We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors.
We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - An Extensive Study on Cross-Dataset Bias and Evaluation Metrics
Interpretation for Machine Learning applied to Gastrointestinal Tract
Abnormality Classification [2.985964157078619]
Automatic analysis of diseases in the GI tract is a hot topic in computer science and medical-related journals.
A clear understanding of evaluation metrics and machine learning models with cross datasets is crucial to bring research in the field to a new quality level.
We present comprehensive evaluations of five distinct machine learning models that can classify 16 different GI tract conditions.
arXiv Detail & Related papers (2020-05-08T08:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.