Related papers: Comparative Experimentation of Accuracy Metrics in Automated Medical Reporting: The Case of Otitis Consultations

Comparative Experimentation of Accuracy Metrics in Automated Medical Reporting: The Case of Otitis Consultations

URL: http://arxiv.org/abs/2311.13273v2
Date: Mon, 8 Jan 2024 14:19:29 GMT
Title: Comparative Experimentation of Accuracy Metrics in Automated Medical Reporting: The Case of Otitis Consultations
Authors: Wouter Faber, Renske Eline Bootsma, Tom Huibers, Sandra van Dulmen, Sjaak Brinkkemper
Abstract summary: Generative Artificial Intelligence can be used to automatically generate medical reports based on transcripts of medical consultations. The accuracy of the generated reports needs to be established to ensure their correctness and usefulness. There are several metrics for measuring the accuracy of AI generated reports, but little work has been done towards the application of these metrics in medical reporting.
Score: 0.5242869847419834
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Generative Artificial Intelligence (AI) can be used to automatically generate medical reports based on transcripts of medical consultations. The aim is to reduce the administrative burden that healthcare professionals face. The accuracy of the generated reports needs to be established to ensure their correctness and usefulness. There are several metrics for measuring the accuracy of AI generated reports, but little work has been done towards the application of these metrics in medical reporting. A comparative experimentation of 10 accuracy metrics has been performed on AI generated medical reports against their corresponding General Practitioner's (GP) medical reports concerning Otitis consultations. The number of missing, incorrect, and additional statements of the generated reports have been correlated with the metric scores. In addition, we introduce and define a Composite Accuracy Score which produces a single score for comparing the metrics within the field of automated medical reporting. Findings show that based on the correlation study and the Composite Accuracy Score, the ROUGE-L and Word Mover's Distance metrics are the preferred metrics, which is not in line with previous work. These findings help determine the accuracy of an AI generated medical report, which aids the development of systems that generate medical reports for GPs to reduce the administrative burden.

Related papers

Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval [61.70489848327436]
KARE is a novel framework that integrates knowledge graph (KG) community-level retrieval with large language models (LLMs) reasoning. Extensive experiments demonstrate that KARE outperforms leading models by up to 10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and readmission predictions.
arXiv Detail & Related papers (2024-10-06T18:46:28Z)
A GEN AI Framework for Medical Note Generation [3.7444770630637167]
MediNotes is an advanced generative AI framework designed to automate the creation of SOAP (Subjective, Objective, Assessment, Plan) notes from medical conversations. MediNotes integrates Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and Automatic Speech Recognition (ASR) to capture and process both text and voice inputs in real time or from recorded audio.
arXiv Detail & Related papers (2024-09-27T23:05:02Z)
ReXamine-Global: A Framework for Uncovering Inconsistencies in Radiology Report Generation Metrics [3.028298624225796]
ReXamine-Global is a framework that tests metrics across different writing styles and patient populations. We apply ReXamine-Global to 7 established report evaluation metrics and uncover serious gaps in their generalizability.
arXiv Detail & Related papers (2024-08-29T02:03:05Z)
RaTEScore: A Metric for Radiology Report Generation [59.37561810438641]
This paper introduces a novel, entity-aware metric, as Radiological Report (Text) Evaluation (RaTEScore) RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions. Our evaluations demonstrate that RaTEScore aligns more closely with human preference than existing metrics, validated both on established public benchmarks and our newly proposed RaTE-Eval benchmark.
arXiv Detail & Related papers (2024-06-24T17:49:28Z)
Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding [72.18719355481052]
We introduce a novel task called Medical Report Grounding (MRG)<n>MRG aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner.<n>We propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases.
arXiv Detail & Related papers (2024-04-10T07:41:35Z)
Reshaping Free-Text Radiology Notes Into Structured Reports With Generative Transformers [0.29530625605275984]
structured reporting (SR) has been recommended by various medical societies. We propose a pipeline to extract information from free-text reports. Our work aims to leverage the potential of Natural Language Processing (NLP) and Transformer-based models.
arXiv Detail & Related papers (2024-03-27T18:38:39Z)
Enhancing Summarization Performance through Transformer-Based Prompt Engineering in Automated Medical Reporting [0.49478969093606673]
Two-shot prompting approach in combination with scope and domain context outperforms other methods. The automated reports are approximately twice as long as the human references.
arXiv Detail & Related papers (2023-11-22T09:51:53Z)
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency. evaluating LLMs on realistic text generation tasks for healthcare remains challenging. We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z)
Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes. We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z)
Supervised Machine Learning Algorithm for Detecting Consistency between Reported Findings and the Conclusions of Mammography Reports [66.89977257992568]
Mammography reports document the diagnosis of patients' conditions. Many reports contain non-standard terms (non-BI-RADS descriptors) and incomplete statements. Our aim was to develop a tool to detect such discrepancies by comparing the reported conclusions to those that would be expected based on the reported radiology findings.
arXiv Detail & Related papers (2022-02-28T08:59:04Z)
Chest X-ray Report Generation through Fine-Grained Label Learning [46.352966049776875]
We present a domain-aware automatic chest X-ray radiology report generation algorithm that learns fine-grained description of findings from images. We also develop an automatic labeling algorithm for assigning such descriptors to images and build a novel deep learning network that recognizes both coarse and fine-grained descriptions of findings.
arXiv Detail & Related papers (2020-07-27T19:50:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.