Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models
- URL: http://arxiv.org/abs/2502.12414v2
- Date: Wed, 04 Jun 2025 23:04:18 GMT
- Title: Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models
- Authors: Hanin Atwany, Abdul Waheed, Rita Singh, Monojit Choudhury, Bhiksha Raj,
- Abstract summary: hallucination is especially concerning in high-stakes domains such as healthcare, legal, and aviation.<n>We examine how factors such as distribution shifts, model size, and model architecture influence hallucination error rate (HER), a metric we introduce to quantify hallucinations.<n>Our findings highlight the importance of incorporating HER alongside traditional metrics like WER to better assess ASR model performance.
- Score: 36.327525062842724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech foundation models trained at a massive scale, both in terms of model and data size, result in robust systems capable of performing multiple speech tasks, including automatic speech recognition (ASR). These models transcend language and domain barriers, yet effectively measuring their performance remains a challenge. Traditional metrics like word error rate (WER) and character error rate (CER) are commonly used to evaluate ASR performance but often fail to reflect transcription quality in critical contexts, particularly when detecting fabricated outputs. This phenomenon, known as hallucination, is especially concerning in high-stakes domains such as healthcare, legal, and aviation, where errors can have severe consequences. In our work, we address this gap by investigating hallucination in ASR models. We examine how factors such as distribution shifts, model size, and model architecture influence the hallucination error rate (HER), a metric we introduce to quantify hallucinations. Our analysis of over 20 ASR models reveals \numinsights~key insights: (1) High WERs can mask low hallucination rates, while low WERs may conceal dangerous hallucinations. (2) Synthetic noise, both adversarial and common perturbations like white noise, pitch shift, and time stretching, increase HER. (3) Distribution shift correlates strongly with HER ($\alpha = 0.91$). Our findings highlight the importance of incorporating HER alongside traditional metrics like WER to better assess ASR model performance, particularly in high-stakes domains.
Related papers
- Hallucination Benchmark for Speech Foundation Models [33.92968426403491]
Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic input (i.e., the speech signal)<n>This apparent coherence can mislead subsequent processing stages and introduce serious risks, particularly in critical domains such as healthcare and law.<n>We introduce SHALLOW, the first benchmark framework that systematically categorizes and quantifies hallucination phenomena in ASR along four complementary axes: lexical, phonetic, morphological, and semantic.
arXiv Detail & Related papers (2025-10-18T16:26:16Z) - SHALE: A Scalable Benchmark for Fine-grained Hallucination Evaluation in LVLMs [52.03164192840023]
Large Vision-Language Models (LVLMs) still suffer from hallucinations, i.e., generating content inconsistent with input or established world knowledge.<n>We propose an automated data construction pipeline that produces scalable, controllable, and diverse evaluation data.<n>We construct SHALE, a benchmark designed to assess both faithfulness and factuality hallucinations.
arXiv Detail & Related papers (2025-08-13T07:58:01Z) - Hallucination Score: Towards Mitigating Hallucinations in Generative Image Super-Resolution [4.620784952867118]
We focus on measuring, analyzing, and mitigating "hallucinations"<n>We take advantage of a multimodal large language model (MLLM) by constructing a prompt that assesses hallucinatory visual elements and generates a "Hallucination Score" (HS)<n>Our HS is closely aligned with human evaluations, and also provides complementary insights to prior image metrics used for super-resolution (SR) models.
arXiv Detail & Related papers (2025-07-18T21:13:50Z) - Mitigating Object Hallucinations via Sentence-Level Early Intervention [10.642552315531404]
Multimodal large language models (MLLMs) have revolutionized cross-modal understanding but continue to struggle with hallucinations.<n>We propose SENTINEL, a framework that eliminates dependency on human annotations.<n>Sentence-level Early iNtervention Through IN-domain prEference Learning can reduce hallucinations by over 90% compared to the original model.
arXiv Detail & Related papers (2025-07-16T17:55:43Z) - Shadows in the Attention: Contextual Perturbation and Representation Drift in the Dynamics of Hallucination in LLMs [6.190663515080656]
We present the first systematic study linking hallucination incidence to internal-state drift induced by context injection.<n>Using TruthfulQA, we construct two 16-round "titration" tracks per question.<n>We track overt hallucination rates with a tri-perspective detector and covert dynamics via cosine, entropy, JS and Spearman drifts of hidden states and attention maps.
arXiv Detail & Related papers (2025-05-22T16:50:58Z) - Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling [67.14942827452161]
Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations.
In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification.
arXiv Detail & Related papers (2025-04-17T17:59:22Z) - HausaNLP at SemEval-2025 Task 3: Towards a Fine-Grained Model-Aware Hallucination Detection [1.8230982862848586]
We aim to provide a nuanced, model-aware understanding of hallucination occurrences and severity in English.
We used natural language inference and fine-tuned a ModernBERT model using a synthetic dataset of 400 samples.
Results indicate a moderately positive correlation between the model's confidence scores and the actual presence of hallucinations.
arXiv Detail & Related papers (2025-03-25T13:40:22Z) - KSHSeek: Data-Driven Approaches to Mitigating and Detecting Knowledge-Shortcut Hallucinations in Generative Models [17.435794516702256]
Large language models (LLMs) have significantly advanced the development of natural language processing (NLP)
Model hallucinations remain a major challenge in natural language generation (NLG) tasks due to their complex causes.
This work introduces a new paradigm for mitigating specific hallucination issues in generative models, enhancing their robustness and reliability in real-world applications.
arXiv Detail & Related papers (2025-03-25T09:18:27Z) - Reefknot: A Comprehensive Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in Multimodal Large Language Models [13.48296910438554]
We introduce Reefknot, a comprehensive benchmark targeting relation hallucinations, comprising over 20,000 real-world samples.<n>We provide a systematic definition of relation hallucinations, integrating perceptive and cognitive perspectives, and construct a relation-based corpus using the Visual Genome scene graph dataset.<n>We propose a novel confidence-based mitigation strategy, which reduces the hallucination rate by an average of 9.75% across three datasets, including Reefknot.
arXiv Detail & Related papers (2024-08-18T10:07:02Z) - Knowledge Overshadowing Causes Amalgamated Hallucination in Large Language Models [65.32990889402927]
We coin this phenomenon as knowledge overshadowing''
We show that the hallucination rate grows with both the imbalance ratio and the length of dominant condition description.
We propose to utilize overshadowing conditions as a signal to catch hallucination before it is produced.
arXiv Detail & Related papers (2024-07-10T20:37:42Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - Quantity Matters: Towards Assessing and Mitigating Number Hallucination in Large Vision-Language Models [57.42800112251644]
We focus on a specific type of hallucination-number hallucination, referring to models incorrectly identifying the number of certain objects in pictures.
We devise a training approach aimed at improving consistency to reduce number hallucinations, which leads to an 8% enhancement in performance over direct finetuning methods.
arXiv Detail & Related papers (2024-03-03T02:31:11Z) - Hallucinations in Neural Automatic Speech Recognition: Identifying
Errors and Hallucinatory Models [11.492702369437785]
Hallucinations are semantically unrelated to the source utterance, yet still fluent and coherent.
We show that commonly used metrics, such as word error rates, cannot differentiate between hallucinatory and non-hallucinatory models.
We devise a framework for identifying hallucinations by analysing their semantic connection with the ground truth and their fluency.
arXiv Detail & Related papers (2024-01-03T06:56:56Z) - Towards Mitigating Hallucination in Large Language Models via
Self-Reflection [63.2543947174318]
Large language models (LLMs) have shown promise for generative and knowledge-intensive tasks including question-answering (QA) tasks.
This paper analyses the phenomenon of hallucination in medical generative QA systems using widely adopted LLMs and datasets.
arXiv Detail & Related papers (2023-10-10T03:05:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.