Related papers: Hallucination Benchmark for Speech Foundation Models

Hallucination Benchmark for Speech Foundation Models

URL: http://arxiv.org/abs/2510.16567v1
Date: Sat, 18 Oct 2025 16:26:16 GMT
Title: Hallucination Benchmark for Speech Foundation Models
Authors: Alkis Koudounas, Moreno La Quatra, Manuel Giollo, Sabato Marco Siniscalchi, Elena Baralis,
Abstract summary: Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic input (i.e., the speech signal)<n>This apparent coherence can mislead subsequent processing stages and introduce serious risks, particularly in critical domains such as healthcare and law.<n>We introduce SHALLOW, the first benchmark framework that systematically categorizes and quantifies hallucination phenomena in ASR along four complementary axes: lexical, phonetic, morphological, and semantic.
Score: 33.92968426403491
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic input (i.e., the speech signal). While similar to conventional decoding errors in potentially compromising the usability of transcriptions for downstream applications, hallucinations can be more detrimental due to their preservation of syntactically and semantically plausible structure. This apparent coherence can mislead subsequent processing stages and introduce serious risks, particularly in critical domains such as healthcare and law. Conventional evaluation metrics are primarily centered on error-based metrics and fail to distinguish between phonetic inaccuracies and hallucinations. Consequently, there is a critical need for new evaluation frameworks that can effectively identify and assess models with a heightened propensity for generating hallucinated content. To this end, we introduce SHALLOW, the first benchmark framework that systematically categorizes and quantifies hallucination phenomena in ASR along four complementary axes: lexical, phonetic, morphological, and semantic. We define targeted metrics within each category to produce interpretable profiles of model behavior. Through evaluation across various architectures and speech domains, we have found that SHALLOW metrics correlate strongly with word error rate (WER) when recognition quality is high (i.e., low WER). Still, this correlation weakens substantially as WER increases. SHALLOW, therefore, captures fine-grained error patterns that WER fails to distinguish under degraded and challenging conditions. Our framework supports specific diagnosis of model weaknesses and provides feedback for model improvement beyond what aggregate error rates can offer.

Related papers

Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering [94.37535002230504]
We develop a training-free, inference-time control framework termed Semantically Decoupled Latent Steering.<n>Our approach constructs a semantic-free intervention vector via large language model (LLM)-driven semantic decomposition.<n>We show that our approach significantly reduces the probability of historical hallucinations.
arXiv Detail & Related papers (2026-02-27T04:49:01Z)
Semantic Visual Anomaly Detection and Reasoning in AI-Generated Images [96.43608872116347]
AnomReason is a large-scale benchmark with structured annotations as quadruple textbfAnomAgent<n>AnomReason and AnomAgent serve as a foundation for measuring and improving the semantic plausibility of AI-generated images.
arXiv Detail & Related papers (2025-10-11T14:09:24Z)
SHALE: A Scalable Benchmark for Fine-grained Hallucination Evaluation in LVLMs [52.03164192840023]
Large Vision-Language Models (LVLMs) still suffer from hallucinations, i.e., generating content inconsistent with input or established world knowledge.<n>We propose an automated data construction pipeline that produces scalable, controllable, and diverse evaluation data.<n>We construct SHALE, a benchmark designed to assess both faithfulness and factuality hallucinations.
arXiv Detail & Related papers (2025-08-13T07:58:01Z)
Reading Between the Lines: Combining Pause Dynamics and Semantic Coherence for Automated Assessment of Thought Disorder [8.239710313549466]
This study integrates pause features with semantic coherence metrics across three datasets.<n>Key findings demonstrate that pause features alone robustly predict the severity of formal thought disorder (FTD)<n>These findings suggest that frameworks combining temporal and semantic analyses provide a roadmap for refining the assessment of disorganized speech.
arXiv Detail & Related papers (2025-07-17T22:00:16Z)
Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models [36.327525062842724]
hallucination is especially concerning in high-stakes domains such as healthcare, legal, and aviation.<n>We examine how factors such as distribution shifts, model size, and model architecture influence hallucination error rate (HER), a metric we introduce to quantify hallucinations.<n>Our findings highlight the importance of incorporating HER alongside traditional metrics like WER to better assess ASR model performance.
arXiv Detail & Related papers (2025-02-18T01:25:39Z)
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs. Existing benchmarks are often limited in scope, focusing mainly on object hallucinations. We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z)
Toward Practical Automatic Speech Recognition and Post-Processing: a Call for Explainable Error Benchmark Guideline [12.197453599489963]
We propose the development of an Error Explainable Benchmark (EEB) dataset. This dataset, while considering both speech- and text-level, enables a granular understanding of the model's shortcomings. Our proposition provides a structured pathway for a more real-world-centric' evaluation, allowing for the detection and rectification of nuanced system weaknesses.
arXiv Detail & Related papers (2024-01-26T03:42:45Z)
Useful Blunders: Can Automated Speech Recognition Errors Improve Downstream Dementia Classification? [9.275790963007173]
We investigated how errors from automatic speech recognition (ASR) systems affect dementia classification accuracy. We aimed to assess whether imperfect ASR-generated transcripts could provide valuable information.
arXiv Detail & Related papers (2024-01-10T21:38:03Z)
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses. LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z)
The Far Side of Failure: Investigating the Impact of Speech Recognition Errors on Subsequent Dementia Classification [8.032686410648274]
Linguistic anomalies detectable in spontaneous speech have shown promise for various clinical applications including screening for dementia and other forms of cognitive impairment. The impressive performance of self-supervised learning (SSL) automatic speech recognition (ASR) models with curated speech data is not apparent with challenging speech samples from clinical settings. One of our key findings is that, paradoxically, ASR systems with relatively high error rates can produce transcripts that result in better downstream classification accuracy than classification based on verbatim transcripts.
arXiv Detail & Related papers (2022-11-11T17:06:45Z)
Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem. Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols. By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.