Related papers: Benchmarking von ASR-Modellen im deutschen medizinischen Kontext: Eine Leistungsanalyse anhand von Anamnesegesprächen

Benchmarking von ASR-Modellen im deutschen medizinischen Kontext: Eine Leistungsanalyse anhand von Anamnesegesprächen

URL: http://arxiv.org/abs/2601.19945v1
Date: Fri, 23 Jan 2026 22:32:40 GMT
Title: Benchmarking von ASR-Modellen im deutschen medizinischen Kontext: Eine Leistungsanalyse anhand von Anamnesegesprächen
Authors: Thomas Schuster, Julius Trögele, Nico Döring, Robin Krüger, Matthieu Hoffmann, Holger Friedrich,
Abstract summary: We present a curated dataset of simulated doctor-patient conversations and evaluate a total of 29 different ASR models.<n>For evaluation, we utilize three different metrics (WER, CER, BLEU) and provide an outlook on qualitative semantic analysis.
Score: 0.0021757536468331165
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Automatic Speech Recognition (ASR) offers significant potential to reduce the workload of medical personnel, for example, through the automation of documentation tasks. While numerous benchmarks exist for the English language, specific evaluations for the German-speaking medical context are still lacking, particularly regarding the inclusion of dialects. In this article, we present a curated dataset of simulated doctor-patient conversations and evaluate a total of 29 different ASR models. The test field encompasses both open-weights models from the Whisper, Voxtral, and Wav2Vec2 families as well as commercial state-of-the-art APIs (AssemblyAI, Deepgram). For evaluation, we utilize three different metrics (WER, CER, BLEU) and provide an outlook on qualitative semantic analysis. The results demonstrate significant performance differences between the models: while the best systems already achieve very good Word Error Rates (WER) of partly below 3%, the error rates of other models, especially concerning medical terminology or dialect-influenced variations, are considerably higher.

Related papers

When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation [18.338933046286257]
Large language models (LLMs) are increasingly employed to address diverse problems, including medical queries.<n>LLMs often perform poorly in medical contexts, potentially leading to harmful misguidance for users.<n>This paper focuses on fine-tuning the Llama 2 7B, a transformer-based, decoder-only model, using transcripts from real patient-doctor interactions.
arXiv Detail & Related papers (2026-02-27T21:09:43Z)
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models [54.48710348910535]
Existing medical reasoning benchmarks primarily focus on analyzing a patient's condition based on an image from a single visit.<n>We introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions between different clinical visits.
arXiv Detail & Related papers (2025-09-29T17:51:26Z)
MORQA: Benchmarking Evaluation Metrics for Medical Open-Ended Question Answering [11.575146661047368]
We introduce MORQA, a new multilingual benchmark designed to assess the effectiveness of NLG evaluation metrics.<n>We benchmark both traditional metrics and large language model (LLM)-based evaluators, such as GPT-4 and Gemini.<n>Our results provide the first comprehensive, multilingual qualitative study of NLG evaluation in the medical domain.
arXiv Detail & Related papers (2025-09-15T19:51:57Z)
From Fuzzy Speech to Medical Insight: Benchmarking LLMs on Noisy Patient Narratives [40.12543056558646]
We present a novel dataset designed to simulate patient self-descriptions characterized by varying levels of linguistic noise, fuzzy language, and layperson terminology.<n>Our dataset comprises clinically consistent scenarios annotated with ground-truth diagnoses, spanning a spectrum of communication clarity to reflect diverse real-world reporting styles.<n>To support and future research, we release the Noisy Diagnostic Benchmark (NDB), a structured dataset of noisy, synthetic patient descriptions designed to stress-test and compare the diagnostic capabilities of large language models (LLMs) under realistic linguistic conditions.
arXiv Detail & Related papers (2025-09-15T11:34:46Z)
AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z)
MedQARo: A Large-Scale Benchmark for Medical Question Answering in Romanian [50.767415194856135]
We introduce MedQARo, the first large-scale medical QA benchmark in Romanian.<n>We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients.
arXiv Detail & Related papers (2025-08-22T13:48:37Z)
Out of the Box, into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults [2.01562032767537]
This study evaluates state-of-the-art Automatic Speech Recognition (ASR) models on language use of older Dutch adults.<n>We benchmark generic multilingual ASR models, and models fine-tuned for Dutch spoken by older adults.<n>Our results show that generic multilingual models outperform fine-tuned models, which suggests recent ASR models can generalise well out of the box to real-world datasets.
arXiv Detail & Related papers (2025-08-12T07:17:44Z)
3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark [2.3011663397108078]
3MDBench is an open-source framework for simulating and evaluating LVLM-driven telemedical consultations.<n> multimodal dialogue with internal reasoning improves F1 score by 6.5% over non-dialogue settings.<n> injecting predictions from a diagnostic convolutional neural network into the LVLM's context boosts F1 by up to 20%.
arXiv Detail & Related papers (2025-03-26T07:32:05Z)
Conversation AI Dialog for Medicare powered by Finetuning and Retrieval Augmented Generation [0.0]
Large language models (LLMs) have shown impressive capabilities in natural language processing tasks, including dialogue generation.<n>This research aims to conduct a novel comparative analysis of two prominent techniques, fine-tuning with LoRA and the Retrieval-Augmented Generation framework.
arXiv Detail & Related papers (2025-02-04T11:50:40Z)
Performant ASR Models for Medical Entities in Accented Speech [0.9346027495459037]
We rigorously evaluate multiple ASR models on a clinical English dataset of 93 African accents. Our analysis reveals that despite some models achieving low overall word error rates (WER), errors in clinical entities are higher, potentially posing substantial risks to patient safety.
arXiv Detail & Related papers (2024-06-18T08:19:48Z)
RuMedBench: A Russian Medical Language Understanding Benchmark [58.99199480170909]
The paper describes the open Russian medical language understanding benchmark covering several task types. We prepare the unified format labeling, data split, and evaluation metrics for new tasks. A single-number metric expresses a model's ability to cope with the benchmark.
arXiv Detail & Related papers (2022-01-17T16:23:33Z)
CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark. It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification. We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z)
Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task. The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them. By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.