Comparative Experimentation of Accuracy Metrics in Automated Medical
Reporting: The Case of Otitis Consultations
- URL: http://arxiv.org/abs/2311.13273v2
- Date: Mon, 8 Jan 2024 14:19:29 GMT
- Title: Comparative Experimentation of Accuracy Metrics in Automated Medical
Reporting: The Case of Otitis Consultations
- Authors: Wouter Faber, Renske Eline Bootsma, Tom Huibers, Sandra van Dulmen,
Sjaak Brinkkemper
- Abstract summary: Generative Artificial Intelligence can be used to automatically generate medical reports based on transcripts of medical consultations.
The accuracy of the generated reports needs to be established to ensure their correctness and usefulness.
There are several metrics for measuring the accuracy of AI generated reports, but little work has been done towards the application of these metrics in medical reporting.
- Score: 0.5242869847419834
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Generative Artificial Intelligence (AI) can be used to automatically generate
medical reports based on transcripts of medical consultations. The aim is to
reduce the administrative burden that healthcare professionals face. The
accuracy of the generated reports needs to be established to ensure their
correctness and usefulness. There are several metrics for measuring the
accuracy of AI generated reports, but little work has been done towards the
application of these metrics in medical reporting. A comparative
experimentation of 10 accuracy metrics has been performed on AI generated
medical reports against their corresponding General Practitioner's (GP) medical
reports concerning Otitis consultations. The number of missing, incorrect, and
additional statements of the generated reports have been correlated with the
metric scores. In addition, we introduce and define a Composite Accuracy Score
which produces a single score for comparing the metrics within the field of
automated medical reporting. Findings show that based on the correlation study
and the Composite Accuracy Score, the ROUGE-L and Word Mover's Distance metrics
are the preferred metrics, which is not in line with previous work. These
findings help determine the accuracy of an AI generated medical report, which
aids the development of systems that generate medical reports for GPs to reduce
the administrative burden.
Related papers
- Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval [61.70489848327436]
KARE is a novel framework that integrates knowledge graph (KG) community-level retrieval with large language models (LLMs) reasoning.
Extensive experiments demonstrate that KARE outperforms leading models by up to 10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and readmission predictions.
arXiv Detail & Related papers (2024-10-06T18:46:28Z) - A GEN AI Framework for Medical Note Generation [3.7444770630637167]
MediNotes is an advanced generative AI framework designed to automate the creation of SOAP (Subjective, Objective, Assessment, Plan) notes from medical conversations.
MediNotes integrates Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and Automatic Speech Recognition (ASR) to capture and process both text and voice inputs in real time or from recorded audio.
arXiv Detail & Related papers (2024-09-27T23:05:02Z) - ReXamine-Global: A Framework for Uncovering Inconsistencies in Radiology Report Generation Metrics [3.028298624225796]
ReXamine-Global is a framework that tests metrics across different writing styles and patient populations.
We apply ReXamine-Global to 7 established report evaluation metrics and uncover serious gaps in their generalizability.
arXiv Detail & Related papers (2024-08-29T02:03:05Z) - RaTEScore: A Metric for Radiology Report Generation [59.37561810438641]
This paper introduces a novel, entity-aware metric, as Radiological Report (Text) Evaluation (RaTEScore)
RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions.
Our evaluations demonstrate that RaTEScore aligns more closely with human preference than existing metrics, validated both on established public benchmarks and our newly proposed RaTE-Eval benchmark.
arXiv Detail & Related papers (2024-06-24T17:49:28Z) - Reshaping Free-Text Radiology Notes Into Structured Reports With Generative Transformers [0.29530625605275984]
structured reporting (SR) has been recommended by various medical societies.
We propose a pipeline to extract information from free-text reports.
Our work aims to leverage the potential of Natural Language Processing (NLP) and Transformer-based models.
arXiv Detail & Related papers (2024-03-27T18:38:39Z) - Enhancing Summarization Performance through Transformer-Based Prompt
Engineering in Automated Medical Reporting [0.49478969093606673]
Two-shot prompting approach in combination with scope and domain context outperforms other methods.
The automated reports are approximately twice as long as the human references.
arXiv Detail & Related papers (2023-11-22T09:51:53Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - Human Evaluation and Correlation with Automatic Metrics in Consultation
Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes.
We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors.
We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z) - Supervised Machine Learning Algorithm for Detecting Consistency between
Reported Findings and the Conclusions of Mammography Reports [66.89977257992568]
Mammography reports document the diagnosis of patients' conditions.
Many reports contain non-standard terms (non-BI-RADS descriptors) and incomplete statements.
Our aim was to develop a tool to detect such discrepancies by comparing the reported conclusions to those that would be expected based on the reported radiology findings.
arXiv Detail & Related papers (2022-02-28T08:59:04Z) - Chest X-ray Report Generation through Fine-Grained Label Learning [46.352966049776875]
We present a domain-aware automatic chest X-ray radiology report generation algorithm that learns fine-grained description of findings from images.
We also develop an automatic labeling algorithm for assigning such descriptors to images and build a novel deep learning network that recognizes both coarse and fine-grained descriptions of findings.
arXiv Detail & Related papers (2020-07-27T19:50:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.