Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics
- URL: http://arxiv.org/abs/2603.02252v1
- Date: Fri, 27 Feb 2026 01:44:39 GMT
- Title: Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics
- Authors: Mandip Goswami,
- Abstract summary: We introduce Whisper-RIR-Mega, a benchmark dataset of paired clean and reverberant speech for evaluating automatic speech recognition (ASR) robustness to room acoustics.<n>Each sample pairs a clean LibriSpeech utterance with the same utterance convolved with a real room impulse response from the RIR-Mega corpus.<n>We evaluate five Whisper models on 1600 test samples and report word error rate (WER) and character error rate (CER) under clean and reverberant conditions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Whisper-RIR-Mega, a benchmark dataset of paired clean and reverberant speech for evaluating automatic speech recognition (ASR) robustness to room acoustics. Each sample pairs a clean LibriSpeech utterance with the same utterance convolved with a real room impulse response from the RIR-Mega corpus, with stratified splits by reverberation time (RT60) and direct-to-reverberant ratio (DRR). We evaluate five Whisper models (tiny through large-v3) on 1600 test samples and report word error rate (WER) and character error rate (CER) under clean and reverberant conditions. Reverberation consistently degrades performance across all model sizes; the reverb penalty in WER ranges from 0.12 to 1.07 percentage points depending on the model. We release the dataset, evaluation code, and baseline results to support reproducible research on robust ASR.
Related papers
- RIR-Mega-Speech: A Reverberant Speech Corpus with Comprehensive Acoustic Metadata and Reproducible Evaluation [0.0]
We present RIR-Mega-Speech, a corpus of approximately 117.5 hours created by convolving LibriSpeech utterances.<n>Every file includes RT60, direct-to-reverberant ratio (DRR), and clarity index.<n>We measure 5.20% WER on 1,500 paired utterances, corresponding to a paired increase of 2.50 percentage points.
arXiv Detail & Related papers (2026-01-25T22:17:18Z) - Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses [71.34350093068473]
This paper introduces a new paradigm for generative error correction (GER) framework in audio-visual speech recognition (AVSR)<n>Our framework, DualHyp, empowers a large language model (LLM) to compose independent N-best hypotheses from separate automatic speech recognition (ASR) and visual speech recognition (VSR) models.<n>Our framework attains up to 57.7% error rate gain on the LRS2 benchmark over standard ASR baseline, contrary to single-stream GER approaches that achieve only 10% gain.
arXiv Detail & Related papers (2025-10-15T08:27:16Z) - MB-RIRs: a Synthetic Room Impulse Response Dataset with Frequency-Dependent Absorption Coefficients [7.468132532398651]
We implement three features on top of the traditional image source method-based (ISM) shoebox RIRs.<n>We train a DeepFilternet3 model for each RIR dataset and evaluate the performance on a test set of real RIRs.
arXiv Detail & Related papers (2025-07-13T19:00:26Z) - AV-RIR: Audio-Visual Room Impulse Response Estimation [49.469389715876915]
Accurate estimation of Room Impulse Response (RIR) is important for speech processing and AR/VR applications.
We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and visual cues of its corresponding environment.
arXiv Detail & Related papers (2023-11-30T22:58:30Z) - Synthetic Wave-Geometric Impulse Responses for Improved Speech
Dereverberation [69.1351513309953]
We show that accurately simulating the low-frequency components of Room Impulse Responses (RIRs) is important to achieving good dereverberation.
We demonstrate that speech dereverberation models trained on hybrid synthetic RIRs outperform models trained on RIRs generated by prior geometric ray tracing methods.
arXiv Detail & Related papers (2022-12-10T20:15:23Z) - Towards Improved Room Impulse Response Estimation for Speech Recognition [53.04440557465013]
We propose a novel approach for blind room impulse response (RIR) estimation systems in the context of far-field automatic speech recognition (ASR)
We first draw the connection between improved RIR estimation and improved ASR performance, as a means of evaluating neural RIR estimators.
We then propose a generative adversarial network (GAN) based architecture that encodes RIR features from reverberant speech and constructs an RIR from the encoded features.
arXiv Detail & Related papers (2022-11-08T00:40:27Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - FAST-RIR: Fast neural diffuse room impulse response generator [81.96114823691343]
We present a neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given acoustic environment.
Our FAST-RIR is capable of generating RIRs for a given input reverberation time with an average error of 0.02s.
We show that our proposed FAST-RIR with batch size 1 is 400 times faster than a state-of-the-art diffuse acoustic simulator (DAS) on a CPU.
arXiv Detail & Related papers (2021-10-07T05:21:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.