Clinical BERTScore: An Improved Measure of Automatic Speech Recognition
Performance in Clinical Settings
- URL: http://arxiv.org/abs/2303.05737v4
- Date: Fri, 28 Apr 2023 21:40:15 GMT
- Title: Clinical BERTScore: An Improved Measure of Automatic Speech Recognition
Performance in Clinical Settings
- Authors: Joel Shor, Ruyue Agnes Bi, Subhashini Venugopalan, Steven Ibara, Roman
Goldenberg, Ehud Rivlin
- Abstract summary: We present the Clinical BERTScore (CBERTScore), an ASR metric that penalizes clinically-relevant mistakes more than others.
We collect a benchmark of 18 clinician preferences on 149 realistic medical sentences called the Clinician Transcript Preference benchmark (CTP)
- Score: 7.906934286499333
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic Speech Recognition (ASR) in medical contexts has the potential to
save time, cut costs, increase report accuracy, and reduce physician burnout.
However, the healthcare industry has been slower to adopt this technology, in
part due to the importance of avoiding medically-relevant transcription
mistakes. In this work, we present the Clinical BERTScore (CBERTScore), an ASR
metric that penalizes clinically-relevant mistakes more than others. We
demonstrate that this metric more closely aligns with clinician preferences on
medical sentences as compared to other metrics (WER, BLUE, METEOR, etc),
sometimes by wide margins. We collect a benchmark of 18 clinician preferences
on 149 realistic medical sentences called the Clinician Transcript Preference
benchmark (CTP) and make it publicly available for the community to further
develop clinically-aware ASR metrics. To our knowledge, this is the first
public dataset of its kind. We demonstrate that CBERTScore more closely matches
what clinicians prefer.
Related papers
- Improving Clinical Documentation with AI: A Comparative Study of Sporo AI Scribe and GPT-4o mini [0.0]
Sporo Health's AI scribe was evaluated against OpenAI's GPT-4o Mini.
Results show that Sporo AI consistently outperformed GPT-4o Mini, achieving higher recall, precision, and overall F1 scores.
arXiv Detail & Related papers (2024-10-20T22:48:40Z) - RaTEScore: A Metric for Radiology Report Generation [59.37561810438641]
This paper introduces a novel, entity-aware metric, as Radiological Report (Text) Evaluation (RaTEScore)
RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions.
Our evaluations demonstrate that RaTEScore aligns more closely with human preference than existing metrics, validated both on established public benchmarks and our newly proposed RaTE-Eval benchmark.
arXiv Detail & Related papers (2024-06-24T17:49:28Z) - Performant ASR Models for Medical Entities in Accented Speech [0.9346027495459037]
We rigorously evaluate multiple ASR models on a clinical English dataset of 93 African accents.
Our analysis reveals that despite some models achieving low overall word error rates (WER), errors in clinical entities are higher, potentially posing substantial risks to patient safety.
arXiv Detail & Related papers (2024-06-18T08:19:48Z) - Aiming for Relevance [12.924312063047816]
We introduce novel vital sign prediction performance metrics that align with clinical contexts.
These metrics are derived from empirical utility curves obtained in a previous study through interviews with ICU clinicians.
We employ these metrics as loss functions for neural networks, resulting in models that excel in predicting clinically significant events.
arXiv Detail & Related papers (2024-03-27T15:11:07Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - The Sound of Healthcare: Improving Medical Transcription ASR Accuracy
with Large Language Models [0.0]
Large Language Models (LLMs) can enhance the accuracy of Automatic Speech Recognition (ASR) systems in medical transcription.
Our research focuses on improvements in Word Error Rate (WER), Medical Concept WER (MC-WER) for the accurate transcription of essential medical terms, and speaker diarization accuracy.
arXiv Detail & Related papers (2024-02-12T14:01:12Z) - Cross-Lingual Knowledge Transfer for Clinical Phenotyping [55.92262310716537]
We investigate cross-lingual knowledge transfer strategies to execute this task for clinics that do not use the English language.
We evaluate these strategies for a Greek and a Spanish clinic leveraging clinical notes from different clinical domains.
Our results show that using multilingual data overall improves clinical phenotyping models and can compensate for data sparseness.
arXiv Detail & Related papers (2022-08-03T08:33:21Z) - Clinical Dialogue Transcription Error Correction using Seq2Seq Models [1.663938381339885]
We present a seq2seq learning approach for ASR transcription error correction of clinical dialogues.
We fine-tune a seq2seq model on a mask-filling task using a domain-specific dataset which we have shared publicly for future research.
arXiv Detail & Related papers (2022-05-26T18:27:17Z) - Human Evaluation and Correlation with Automatic Metrics in Consultation
Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes.
We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors.
We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z) - Towards more patient friendly clinical notes through language models and
ontologies [57.51898902864543]
We present a novel approach to automated medical text based on word simplification and language modelling.
We use a new dataset pairs of publicly available medical sentences and a version of them simplified by clinicians.
Our method based on a language model trained on medical forum data generates simpler sentences while preserving both grammar and the original meaning.
arXiv Detail & Related papers (2021-12-23T16:11:19Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.