Related papers: Listen Like a Teacher: Mitigating Whisper Hallucinations using Adaptive Layer Attention and Knowledge Distillation

Listen Like a Teacher: Mitigating Whisper Hallucinations using Adaptive Layer Attention and Knowledge Distillation

URL: http://arxiv.org/abs/2511.14219v1
Date: Tue, 18 Nov 2025 07:52:47 GMT
Title: Listen Like a Teacher: Mitigating Whisper Hallucinations using Adaptive Layer Attention and Knowledge Distillation
Authors: Kumud Tripathi, Aditya Srinivas Menon, Aman Gaurav, Raj Prakash Gohil, Pankaj Wasnik,
Abstract summary: The Whisper model is widely adopted for its strong performance across multilingual and zero-shot settings.<n>Previous works to reduce hallucinations in Whisper-style ASR systems have primarily focused on audio preprocessing or post-processing of transcriptions to filter out erroneous content.<n>We present a two-stage architecture that first enhances encoder robustness through Adaptive Layer Attention (ALA) and further suppresses hallucinations using a multi-objective knowledge distillation (KD) framework.
Score: 9.486565210140279
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Whisper model, an open-source automatic speech recognition system, is widely adopted for its strong performance across multilingual and zero-shot settings. However, it frequently suffers from hallucination errors, especially under noisy acoustic conditions. Previous works to reduce hallucinations in Whisper-style ASR systems have primarily focused on audio preprocessing or post-processing of transcriptions to filter out erroneous content. However, modifications to the Whisper model itself remain largely unexplored to mitigate hallucinations directly. To address this challenge, we present a two-stage architecture that first enhances encoder robustness through Adaptive Layer Attention (ALA) and further suppresses hallucinations using a multi-objective knowledge distillation (KD) framework. In the first stage, ALA groups encoder layers into semantically coherent blocks via inter-layer correlation analysis. A learnable multi-head attention module then fuses these block representations, enabling the model to jointly exploit low- and high-level features for more robust encoding. In the second stage, our KD framework trains the student model on noisy audio to align its semantic and attention distributions with a teacher model processing clean inputs. Our experiments on noisy speech benchmarks show notable reductions in hallucinations and word error rates, while preserving performance on clean speech. Together, ALA and KD offer a principled strategy to improve Whisper's reliability under real-world noisy conditions.

Related papers

CAPTAIN: Semantic Feature Injection for Memorization Mitigation in Text-to-Image Diffusion Models [60.610268549138375]
Diffusion models can unintentionally reproduce training examples, raising privacy and copyright concerns.<n>We introduce CAPTAIN, a training-free framework that mitigates memorization by directly modifying latent features during denoising.
arXiv Detail & Related papers (2025-12-11T14:01:47Z)
Hallucination Benchmark for Speech Foundation Models [33.92968426403491]
Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic input (i.e., the speech signal)<n>This apparent coherence can mislead subsequent processing stages and introduce serious risks, particularly in critical domains such as healthcare and law.<n>We introduce SHALLOW, the first benchmark framework that systematically categorizes and quantifies hallucination phenomena in ASR along four complementary axes: lexical, phonetic, morphological, and semantic.
arXiv Detail & Related papers (2025-10-18T16:26:16Z)
Mitigating Hallucinations via Inter-Layer Consistency Aggregation in Large Vision-Language Models [3.9464481148889354]
We propose a novel decoding mechanism, Decoding with Inter-layer Consistency via Layer Aggregation (DCLA)<n>Our approach constructs a dynamic semantic reference by aggregating representations from previous layers, and corrects semantically deviated layers to enforce inter-layer consistency.<n> Experiments on hallucination benchmarks such as MME and POPE demonstrate that DCLA effectively reduces hallucinations while enhancing the reliability and performance of LVLMs.
arXiv Detail & Related papers (2025-05-18T10:15:42Z)
Multi-Stage Speaker Diarization for Noisy Classrooms [1.4549461207028445]
This study investigates the effectiveness of multi-stage diarization models using Nvidia's NeMo diarization pipeline.<n>We assess the impact of denoising on diarization accuracy and compare various voice activity detection models.<n>We also explore a hybrid VAD approach that integrates Automatic Speech Recognition (ASR) word-level timestamps with frame-level VAD predictions.
arXiv Detail & Related papers (2025-05-16T05:35:06Z)
SEAL: Speaker Error Correction using Acoustic-conditioned Large Language Models [15.098665255729507]
We introduce a novel acoustic conditioning approach to provide more fine-grained information from the acoustic diarizer to the LLM.<n>Our approach significantly reduces the speaker error rates by 24-43% across Fisher, Callhome, and RT03-CTS datasets.
arXiv Detail & Related papers (2025-01-14T20:24:12Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR [35.710735895190844]
We propose a self-supervised framework named Wav2code to implement a feature-level SE with reduced distortions for noise-robust ASR. During finetuning, we propose a Transformer-based code predictor to accurately predict clean codes by modeling the global dependency of input noisy representations. Experiments on both synthetic and real noisy datasets demonstrate that Wav2code can solve the speech distortion and improve ASR performance under various noisy conditions.
arXiv Detail & Related papers (2023-04-11T04:46:12Z)
Self-supervised models of audio effectively explain human cortical responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system. We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z)
Self-Supervised Learning for speech recognition with Intermediate layer supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL) ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers. Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z)
Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments. We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition. We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z)
Multi-task self-supervised learning for Robust Speech Recognition [75.11748484288229]
This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments. We employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances. We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks.
arXiv Detail & Related papers (2020-01-25T00:24:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.