Consultation Checklists: Standardising the Human Evaluation of Medical
Note Generation
- URL: http://arxiv.org/abs/2211.09455v1
- Date: Thu, 17 Nov 2022 10:54:28 GMT
- Title: Consultation Checklists: Standardising the Human Evaluation of Medical
Note Generation
- Authors: Aleksandar Savkov, Francesco Moramarco, Alex Papadopoulos Korfiatis,
Mark Perera, Anya Belz, Ehud Reiter
- Abstract summary: We propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists.
We observed good levels of inter-annotator agreement in a first evaluation study using the protocol.
- Score: 58.54483567073125
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating automatically generated text is generally hard due to the
inherently subjective nature of many aspects of the output quality. This
difficulty is compounded in automatic consultation note generation by differing
opinions between medical experts both about which patient statements should be
included in generated notes and about their respective importance in arriving
at a diagnosis. Previous real-world evaluations of note-generation systems saw
substantial disagreement between expert evaluators. In this paper we propose a
protocol that aims to increase objectivity by grounding evaluations in
Consultation Checklists, which are created in a preliminary step and then used
as a common point of reference during quality assessment. We observed good
levels of inter-annotator agreement in a first evaluation study using the
protocol; further, using Consultation Checklists produced in the study as
reference for automatic metrics such as ROUGE or BERTScore improves their
correlation with human judgements compared to using the original human note.
Related papers
- A Comprehensive Rubric for Annotating Pathological Speech [0.0]
We introduce a comprehensive rubric based on various dimensions of speech quality, including phonetics, fluency, and prosody.
The objective is to establish standardized criteria for identifying errors within the speech of individuals with Down syndrome.
arXiv Detail & Related papers (2024-04-29T16:44:27Z) - Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z) - An Investigation of Evaluation Metrics for Automated Medical Note
Generation [5.094623170336122]
We study evaluation methods and metrics for the automatic generation of clinical notes from medical conversations.
To study the correlation between the automatic metrics and manual judgments, we evaluate automatic notes/summaries by comparing the system and reference facts.
arXiv Detail & Related papers (2023-05-27T04:34:58Z) - Revisiting Automatic Question Summarization Evaluation in the Biomedical
Domain [45.78632945525459]
We conduct human evaluations of summarization quality from four different aspects of a biomedical question summarization task.
Based on human judgments, we identify different noteworthy features for current automatic metrics and summarization systems.
arXiv Detail & Related papers (2023-03-18T04:28:01Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - Human Evaluation and Correlation with Automatic Metrics in Consultation
Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes.
We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors.
We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z) - A preliminary study on evaluating Consultation Notes with Post-Editing [67.30200768442926]
We propose a semi-automatic approach whereby physicians post-edit generated notes before submitting them.
We conduct a preliminary study on the time saving of automatically generated consultation notes with post-editing.
We time this and find that it is faster than writing the note from scratch.
arXiv Detail & Related papers (2021-04-09T14:42:00Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.