Towards Improved Room Impulse Response Estimation for Speech Recognition
- URL: http://arxiv.org/abs/2211.04473v2
- Date: Sun, 19 Mar 2023 20:23:19 GMT
- Title: Towards Improved Room Impulse Response Estimation for Speech Recognition
- Authors: Anton Ratnarajah, Ishwarya Ananthabhotla, Vamsi Krishna Ithapu, Pablo
Hoffmann, Dinesh Manocha, Paul Calamia
- Abstract summary: We propose a novel approach for blind room impulse response (RIR) estimation systems in the context of far-field automatic speech recognition (ASR)
We first draw the connection between improved RIR estimation and improved ASR performance, as a means of evaluating neural RIR estimators.
We then propose a generative adversarial network (GAN) based architecture that encodes RIR features from reverberant speech and constructs an RIR from the encoded features.
- Score: 53.04440557465013
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel approach for blind room impulse response (RIR) estimation
systems in the context of a downstream application scenario, far-field
automatic speech recognition (ASR). We first draw the connection between
improved RIR estimation and improved ASR performance, as a means of evaluating
neural RIR estimators. We then propose a generative adversarial network (GAN)
based architecture that encodes RIR features from reverberant speech and
constructs an RIR from the encoded features, and uses a novel energy decay
relief loss to optimize for capturing energy-based properties of the input
reverberant speech. We show that our model outperforms the state-of-the-art
baselines on acoustic benchmarks (by 17\% on the energy decay relief and 22\%
on an early-reflection energy metric), as well as in an ASR evaluation task (by
6.9\% in word error rate).
Related papers
- AV-RIR: Audio-Visual Room Impulse Response Estimation [49.469389715876915]
Accurate estimation of Room Impulse Response (RIR) is important for speech processing and AR/VR applications.
We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and visual cues of its corresponding environment.
arXiv Detail & Related papers (2023-11-30T22:58:30Z) - Speech enhancement with frequency domain auto-regressive modeling [34.55703785405481]
Speech applications in far-field real world settings often deal with signals that are corrupted by reverberation.
We propose a unified framework of speech dereverberation for improving the speech quality and the automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2023-09-24T03:25:51Z) - Synthetic Wave-Geometric Impulse Responses for Improved Speech
Dereverberation [69.1351513309953]
We show that accurately simulating the low-frequency components of Room Impulse Responses (RIRs) is important to achieving good dereverberation.
We demonstrate that speech dereverberation models trained on hybrid synthetic RIRs outperform models trained on RIRs generated by prior geometric ray tracing methods.
arXiv Detail & Related papers (2022-12-10T20:15:23Z) - CTA-RNN: Channel and Temporal-wise Attention RNN Leveraging Pre-trained
ASR Embeddings for Speech Emotion Recognition [20.02248459288662]
We propose a novel channel and temporal-wise attention RNN architecture based on the intermediate representations of pre-trained ASR models.
We evaluate our approach on two popular benchmark datasets, IEMOCAP and MSP-IMPROV.
arXiv Detail & Related papers (2022-03-31T13:32:51Z) - Deep Impulse Responses: Estimating and Parameterizing Filters with Deep
Networks [76.830358429947]
Impulse response estimation in high noise and in-the-wild settings is a challenging problem.
We propose a novel framework for parameterizing and estimating impulse responses based on recent advances in neural representation learning.
arXiv Detail & Related papers (2022-02-07T18:57:23Z) - FAST-RIR: Fast neural diffuse room impulse response generator [81.96114823691343]
We present a neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given acoustic environment.
Our FAST-RIR is capable of generating RIRs for a given input reverberation time with an average error of 0.02s.
We show that our proposed FAST-RIR with batch size 1 is 400 times faster than a state-of-the-art diffuse acoustic simulator (DAS) on a CPU.
arXiv Detail & Related papers (2021-10-07T05:21:01Z) - Improving RNN Transducer Based ASR with Auxiliary Tasks [21.60022481898402]
End-to-end automatic speech recognition (ASR) models with a single neural network have recently demonstrated state-of-the-art results.
In this work, we examine ways in which recurrent neural network transducer (RNN-T) can achieve better ASR accuracy via performing auxiliary tasks.
arXiv Detail & Related papers (2020-11-05T21:46:32Z) - Improving noise robust automatic speech recognition with single-channel
time-domain enhancement network [100.1041336974175]
We show that a single-channel time-domain denoising approach can significantly improve ASR performance.
We show that single-channel noise reduction can still improve ASR performance.
arXiv Detail & Related papers (2020-03-09T09:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.