Comparative Experimentation of Accuracy Metrics in Automated Medical
Reporting: The Case of Otitis Consultations
- URL: http://arxiv.org/abs/2311.13273v2
- Date: Mon, 8 Jan 2024 14:19:29 GMT
- Title: Comparative Experimentation of Accuracy Metrics in Automated Medical
Reporting: The Case of Otitis Consultations
- Authors: Wouter Faber, Renske Eline Bootsma, Tom Huibers, Sandra van Dulmen,
Sjaak Brinkkemper
- Abstract summary: Generative Artificial Intelligence can be used to automatically generate medical reports based on transcripts of medical consultations.
The accuracy of the generated reports needs to be established to ensure their correctness and usefulness.
There are several metrics for measuring the accuracy of AI generated reports, but little work has been done towards the application of these metrics in medical reporting.
- Score: 0.5242869847419834
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Generative Artificial Intelligence (AI) can be used to automatically generate
medical reports based on transcripts of medical consultations. The aim is to
reduce the administrative burden that healthcare professionals face. The
accuracy of the generated reports needs to be established to ensure their
correctness and usefulness. There are several metrics for measuring the
accuracy of AI generated reports, but little work has been done towards the
application of these metrics in medical reporting. A comparative
experimentation of 10 accuracy metrics has been performed on AI generated
medical reports against their corresponding General Practitioner's (GP) medical
reports concerning Otitis consultations. The number of missing, incorrect, and
additional statements of the generated reports have been correlated with the
metric scores. In addition, we introduce and define a Composite Accuracy Score
which produces a single score for comparing the metrics within the field of
automated medical reporting. Findings show that based on the correlation study
and the Composite Accuracy Score, the ROUGE-L and Word Mover's Distance metrics
are the preferred metrics, which is not in line with previous work. These
findings help determine the accuracy of an AI generated medical report, which
aids the development of systems that generate medical reports for GPs to reduce
the administrative burden.
Related papers
- RaTEScore: A Metric for Radiology Report Generation [59.37561810438641]
This paper introduces a novel, entity-aware metric, as Radiological Report (Text) Evaluation (RaTEScore)
RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions.
Our evaluations demonstrate that RaTEScore aligns more closely with human preference than existing metrics, validated both on established public benchmarks and our newly proposed RaTE-Eval benchmark.
arXiv Detail & Related papers (2024-06-24T17:49:28Z) - GREEN: Generative Radiology Report Evaluation and Error Notation [14.31646900556454]
Green is a radiology report generation metric that leverages the natural language understanding of language models to identify and explain clinically significant errors in candidate reports.
Compared to current metrics, GREEN offers: 1) a score aligned with expert preferences, 2) human interpretable explanations of clinically significant errors, enabling feedback loops with end-users, and 3) a lightweight open-source method that reaches the performance of commercial counterparts.
arXiv Detail & Related papers (2024-05-06T16:04:03Z) - Reshaping Free-Text Radiology Notes Into Structured Reports With Generative Transformers [0.29530625605275984]
structured reporting (SR) has been recommended by various medical societies.
We propose a pipeline to extract information from free-text reports.
Our work aims to leverage the potential of Natural Language Processing (NLP) and Transformer-based models.
arXiv Detail & Related papers (2024-03-27T18:38:39Z) - Enhancing Summarization Performance through Transformer-Based Prompt
Engineering in Automated Medical Reporting [0.49478969093606673]
Two-shot prompting approach in combination with scope and domain context outperforms other methods.
The automated reports are approximately twice as long as the human references.
arXiv Detail & Related papers (2023-11-22T09:51:53Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - An Investigation of Evaluation Metrics for Automated Medical Note
Generation [5.094623170336122]
We study evaluation methods and metrics for the automatic generation of clinical notes from medical conversations.
To study the correlation between the automatic metrics and manual judgments, we evaluate automatic notes/summaries by comparing the system and reference facts.
arXiv Detail & Related papers (2023-05-27T04:34:58Z) - Human Evaluation and Correlation with Automatic Metrics in Consultation
Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes.
We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors.
We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z) - Supervised Machine Learning Algorithm for Detecting Consistency between
Reported Findings and the Conclusions of Mammography Reports [66.89977257992568]
Mammography reports document the diagnosis of patients' conditions.
Many reports contain non-standard terms (non-BI-RADS descriptors) and incomplete statements.
Our aim was to develop a tool to detect such discrepancies by comparing the reported conclusions to those that would be expected based on the reported radiology findings.
arXiv Detail & Related papers (2022-02-28T08:59:04Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Chest X-ray Report Generation through Fine-Grained Label Learning [46.352966049776875]
We present a domain-aware automatic chest X-ray radiology report generation algorithm that learns fine-grained description of findings from images.
We also develop an automatic labeling algorithm for assigning such descriptors to images and build a novel deep learning network that recognizes both coarse and fine-grained descriptions of findings.
arXiv Detail & Related papers (2020-07-27T19:50:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.