Human Evaluation and Correlation with Automatic Metrics in Consultation
Note Generation
- URL: http://arxiv.org/abs/2204.00447v1
- Date: Fri, 1 Apr 2022 14:04:16 GMT
- Title: Human Evaluation and Correlation with Automatic Metrics in Consultation
Note Generation
- Authors: Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Damir
Juric, Jack Flann, Ehud Reiter, Anya Belz, Aleksandar Savkov
- Abstract summary: In recent years, machine learning models have rapidly become better at generating clinical consultation notes.
We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors.
We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
- Score: 56.25869366777579
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, machine learning models have rapidly become better at
generating clinical consultation notes; yet, there is little work on how to
properly evaluate the generated consultation notes to understand the impact
they may have on both the clinician using them and the patient's clinical
safety. To address this we present an extensive human evaluation study of
consultation notes where 5 clinicians (i) listen to 57 mock consultations, (ii)
write their own notes, (iii) post-edit a number of automatically generated
notes, and (iv) extract all the errors, both quantitative and qualitative. We
then carry out a correlation study with 18 automatic quality metrics and the
human judgements. We find that a simple, character-based Levenshtein distance
metric performs on par if not better than common model-based metrics like
BertScore. All our findings and annotations are open-sourced.
Related papers
- Improving Clinical Note Generation from Complex Doctor-Patient Conversation [20.2157016701399]
We present three key contributions to the field of clinical note generation using large language models (LLMs)
First, we introduce CliniKnote, a dataset consisting of 1,200 complex doctor-patient conversations paired with their full clinical notes.
Second, we propose K-SOAP, which enhances traditional SOAPcitepodder20soap (Subjective, Objective, Assessment, and Plan) notes by adding a keyword section at the top, allowing for quick identification of essential information.
Third, we develop an automatic pipeline to generate K-SOAP notes from doctor-patient conversations and benchmark various modern LLMs using various
arXiv Detail & Related papers (2024-08-26T18:39:31Z) - Impact of Large Language Model Assistance on Patients Reading Clinical Notes: A Mixed-Methods Study [46.5728291706842]
We developed a patient-facing tool using large language models (LLMs) to make clinical notes more readable.
We piloted the tool with clinical notes donated by patients with a history of breast cancer and synthetic notes from a clinician.
arXiv Detail & Related papers (2024-01-17T23:14:52Z) - An Investigation of Evaluation Metrics for Automated Medical Note
Generation [5.094623170336122]
We study evaluation methods and metrics for the automatic generation of clinical notes from medical conversations.
To study the correlation between the automatic metrics and manual judgments, we evaluate automatic notes/summaries by comparing the system and reference facts.
arXiv Detail & Related papers (2023-05-27T04:34:58Z) - Revisiting Automatic Question Summarization Evaluation in the Biomedical
Domain [45.78632945525459]
We conduct human evaluations of summarization quality from four different aspects of a biomedical question summarization task.
Based on human judgments, we identify different noteworthy features for current automatic metrics and summarization systems.
arXiv Detail & Related papers (2023-03-18T04:28:01Z) - Consultation Checklists: Standardising the Human Evaluation of Medical
Note Generation [58.54483567073125]
We propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists.
We observed good levels of inter-annotator agreement in a first evaluation study using the protocol.
arXiv Detail & Related papers (2022-11-17T10:54:28Z) - User-Driven Research of Medical Note Generation Software [49.85146209418244]
We present three rounds of user studies carried out in the context of developing a medical note generation system.
We discuss the participating clinicians' impressions and views of how the system ought to be adapted to be of value to them.
We describe a three-week test run of the system in a live telehealth clinical practice.
arXiv Detail & Related papers (2022-05-05T10:18:06Z) - Towards more patient friendly clinical notes through language models and
ontologies [57.51898902864543]
We present a novel approach to automated medical text based on word simplification and language modelling.
We use a new dataset pairs of publicly available medical sentences and a version of them simplified by clinicians.
Our method based on a language model trained on medical forum data generates simpler sentences while preserving both grammar and the original meaning.
arXiv Detail & Related papers (2021-12-23T16:11:19Z) - A preliminary study on evaluating Consultation Notes with Post-Editing [67.30200768442926]
We propose a semi-automatic approach whereby physicians post-edit generated notes before submitting them.
We conduct a preliminary study on the time saving of automatically generated consultation notes with post-editing.
We time this and find that it is faster than writing the note from scratch.
arXiv Detail & Related papers (2021-04-09T14:42:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.