Related papers: Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down

Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down

URL: http://arxiv.org/abs/2505.12969v1
Date: Mon, 19 May 2025 11:04:52 GMT
Title: Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down
Authors: Yingzhi Wang, Anas Alhmoud, Saad Alsahly, Muhammad Alqurishi, Mirco Ravanelli,
Abstract summary: We introduce a novel method to reduce Whisper's hallucination on non-speech segments without using pre- or post-possessing techniques.<n>We benchmark the contribution of each self-attentional head in the Whisper-large-v3 decoder to the hallucination problem by performing a head-wise mask.<n>Our best fine-tuned model, namely Calm-Whisper, achieves over 80% reduction in non-speech hallucination with only less than 0.1% WER.
Score: 9.098293248868503
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: OpenAI's Whisper has achieved significant success in Automatic Speech Recognition. However, it has consistently been found to exhibit hallucination issues, particularly in non-speech segments, which limits its broader application in complex industrial settings. In this paper, we introduce a novel method to reduce Whisper's hallucination on non-speech segments without using any pre- or post-possessing techniques. Specifically, we benchmark the contribution of each self-attentional head in the Whisper-large-v3 decoder to the hallucination problem by performing a head-wise mask. Our findings reveal that only 3 of the 20 heads account for over 75% of the hallucinations on the UrbanSound dataset. We then fine-tune these three crazy heads using a collection of non-speech data. The results show that our best fine-tuned model, namely Calm-Whisper, achieves over 80% reduction in non-speech hallucination with only less than 0.1% WER degradation on LibriSpeech test-clean and test-other.

Related papers

Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression [6.838584336878126]
Large vision language models (LVLMs) often suffer from hallucinations, generating texts misaligned with the visual context.<n>Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency.<n>We present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference.
arXiv Detail & Related papers (2025-05-22T09:00:57Z)
Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio [15.878350948461646]
We investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference.<n>By inducting hallucinations with various types of sounds, we show that there exists a set of hallucinations that appear frequently.<n>We then study hallucinations caused by the augmentation of speech with such sounds.
arXiv Detail & Related papers (2025-01-20T10:14:52Z)
Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models [51.50892380172863]
We show that most state-of-the-art MLLMs suffer from severe verb hallucination.<n>We propose a novel rich verb knowledge-based tuning method to mitigate verb hallucination.
arXiv Detail & Related papers (2024-12-06T10:53:47Z)
ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models [65.12177400764506]
Large language models (LLMs) exhibit hallucinations in long-form question-answering tasks across various domains and wide applications.<n>Current hallucination detection and mitigation datasets are limited in domains and sizes.<n>This paper introduces an iterative self-training framework that simultaneously and progressively scales up the hallucination annotation dataset.
arXiv Detail & Related papers (2024-07-05T17:56:38Z)
Mitigating Object Hallucination in MLLMs via Data-augmented Phrase-level Alignment [52.43197107069751]
Multimodal Large Language Models (MLLMs) often generate factually inaccurate information, referred to as hallucination.<n>We introduce Data-augmented Phrase-level Alignment (DPA), a novel loss which can be applied to instruction-tuned off-the-shelf MLLMs to mitigate hallucinations.
arXiv Detail & Related papers (2024-05-28T23:36:00Z)
ALOHa: A New Measure for Hallucination in Captioning Models [61.007542765171586]
Existing metric for object hallucination, CHAIR, is limited to a fixed set of MS COCO objects and synonyms. We propose a modernized open-vocabulary metric, ALOHa, which leverages large language models (LLMs) to measure object hallucinations. We show that ALOHa correctly identifies 13.6% more hallucinated objects than CHAIR on HAT, a new gold-standard subset of MS COCO Captions annotated for hallucinations.
arXiv Detail & Related papers (2024-04-03T17:59:36Z)
Whispers that Shake Foundations: Analyzing and Mitigating False Premise Hallucinations in Large Language Models [20.025123325871835]
Large Language Models (LLMs) generate hallucinated text when confronted with false premise questions. We propose textbfFAITH (textbfFalse premise textbfAttention head constratextbfIining for mitextbfTigating textbfHallucinations), a novel and effective method to mitigate false premise hallucinations.
arXiv Detail & Related papers (2024-02-29T12:35:45Z)
Careless Whisper: Speech-to-Text Hallucination Harms [0.5242869847419834]
We evaluate Open AI's Whisper, a state-of-the-art automated speech recognition service. We find that roughly 1% of audio transcriptions contained entire hallucinated phrases or sentences. We thematically analyze the Whisper-hallucinated content, finding that 38% of hallucinations include explicit harms.
arXiv Detail & Related papers (2024-02-12T19:35:37Z)
Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations. We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms. We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z)
Using Mobile Data and Deep Models to Assess Auditory Verbal Hallucinations [3.676944894021643]
A common form of auditory hallucination is hearing voices in the absence of any speakers. We study N=435 individuals, who experience hearing voices, to assess auditory verbal hallucination.
arXiv Detail & Related papers (2023-04-20T15:37:34Z)
End-to-end Whispered Speech Recognition with Frequency-weighted Approaches and Pseudo Whisper Pre-training [130.56878980058966]
We present several approaches for end-to-end (E2E) recognition of whispered speech. We achieve an overall relative reduction of 19.8% in PER and 44.4% in CER on a relatively small whispered TIMIT corpus. As long as we have a good E2E model pre-trained on normal or pseudo-whispered speech, a relatively small set of whispered speech may suffice to obtain a reasonably good E2E whispered speech recognizer.
arXiv Detail & Related papers (2020-05-05T07:08:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.