Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down
- URL: http://arxiv.org/abs/2505.12969v1
- Date: Mon, 19 May 2025 11:04:52 GMT
- Title: Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down
- Authors: Yingzhi Wang, Anas Alhmoud, Saad Alsahly, Muhammad Alqurishi, Mirco Ravanelli,
- Abstract summary: We introduce a novel method to reduce Whisper's hallucination on non-speech segments without using pre- or post-possessing techniques.<n>We benchmark the contribution of each self-attentional head in the Whisper-large-v3 decoder to the hallucination problem by performing a head-wise mask.<n>Our best fine-tuned model, namely Calm-Whisper, achieves over 80% reduction in non-speech hallucination with only less than 0.1% WER.
- Score: 9.098293248868503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: OpenAI's Whisper has achieved significant success in Automatic Speech Recognition. However, it has consistently been found to exhibit hallucination issues, particularly in non-speech segments, which limits its broader application in complex industrial settings. In this paper, we introduce a novel method to reduce Whisper's hallucination on non-speech segments without using any pre- or post-possessing techniques. Specifically, we benchmark the contribution of each self-attentional head in the Whisper-large-v3 decoder to the hallucination problem by performing a head-wise mask. Our findings reveal that only 3 of the 20 heads account for over 75% of the hallucinations on the UrbanSound dataset. We then fine-tune these three crazy heads using a collection of non-speech data. The results show that our best fine-tuned model, namely Calm-Whisper, achieves over 80% reduction in non-speech hallucination with only less than 0.1% WER degradation on LibriSpeech test-clean and test-other.
Related papers
- Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression [6.838584336878126]
Large vision language models (LVLMs) often suffer from hallucinations, generating texts misaligned with the visual context.<n>Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency.<n>We present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference.
arXiv Detail & Related papers (2025-05-22T09:00:57Z) - Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio [15.878350948461646]
We investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference.<n>By inducting hallucinations with various types of sounds, we show that there exists a set of hallucinations that appear frequently.<n>We then study hallucinations caused by the augmentation of speech with such sounds.
arXiv Detail & Related papers (2025-01-20T10:14:52Z) - Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models [51.50892380172863]
We show that most state-of-the-art MLLMs suffer from severe verb hallucination.<n>We propose a novel rich verb knowledge-based tuning method to mitigate verb hallucination.
arXiv Detail & Related papers (2024-12-06T10:53:47Z) - ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models [65.12177400764506]
Large language models (LLMs) exhibit hallucinations in long-form question-answering tasks across various domains and wide applications.<n>Current hallucination detection and mitigation datasets are limited in domains and sizes.<n>This paper introduces an iterative self-training framework that simultaneously and progressively scales up the hallucination annotation dataset.
arXiv Detail & Related papers (2024-07-05T17:56:38Z) - Mitigating Object Hallucination in MLLMs via Data-augmented Phrase-level Alignment [52.43197107069751]
Multimodal Large Language Models (MLLMs) often generate factually inaccurate information, referred to as hallucination.<n>We introduce Data-augmented Phrase-level Alignment (DPA), a novel loss which can be applied to instruction-tuned off-the-shelf MLLMs to mitigate hallucinations.
arXiv Detail & Related papers (2024-05-28T23:36:00Z) - ALOHa: A New Measure for Hallucination in Captioning Models [61.007542765171586]
Existing metric for object hallucination, CHAIR, is limited to a fixed set of MS COCO objects and synonyms.
We propose a modernized open-vocabulary metric, ALOHa, which leverages large language models (LLMs) to measure object hallucinations.
We show that ALOHa correctly identifies 13.6% more hallucinated objects than CHAIR on HAT, a new gold-standard subset of MS COCO Captions annotated for hallucinations.
arXiv Detail & Related papers (2024-04-03T17:59:36Z) - Whispers that Shake Foundations: Analyzing and Mitigating False Premise
Hallucinations in Large Language Models [20.025123325871835]
Large Language Models (LLMs) generate hallucinated text when confronted with false premise questions.
We propose textbfFAITH (textbfFalse premise textbfAttention head constratextbfIining for mitextbfTigating textbfHallucinations), a novel and effective method to mitigate false premise hallucinations.
arXiv Detail & Related papers (2024-02-29T12:35:45Z) - Careless Whisper: Speech-to-Text Hallucination Harms [0.5242869847419834]
We evaluate Open AI's Whisper, a state-of-the-art automated speech recognition service.
We find that roughly 1% of audio transcriptions contained entire hallucinated phrases or sentences.
We thematically analyze the Whisper-hallucinated content, finding that 38% of hallucinations include explicit harms.
arXiv Detail & Related papers (2024-02-12T19:35:37Z) - Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations.
We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms.
We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z) - Using Mobile Data and Deep Models to Assess Auditory Verbal
Hallucinations [3.676944894021643]
A common form of auditory hallucination is hearing voices in the absence of any speakers.
We study N=435 individuals, who experience hearing voices, to assess auditory verbal hallucination.
arXiv Detail & Related papers (2023-04-20T15:37:34Z) - End-to-end Whispered Speech Recognition with Frequency-weighted
Approaches and Pseudo Whisper Pre-training [130.56878980058966]
We present several approaches for end-to-end (E2E) recognition of whispered speech.
We achieve an overall relative reduction of 19.8% in PER and 44.4% in CER on a relatively small whispered TIMIT corpus.
As long as we have a good E2E model pre-trained on normal or pseudo-whispered speech, a relatively small set of whispered speech may suffice to obtain a reasonably good E2E whispered speech recognizer.
arXiv Detail & Related papers (2020-05-05T07:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.