Speech enhancement with frequency domain auto-regressive modeling
- URL: http://arxiv.org/abs/2309.13537v1
- Date: Sun, 24 Sep 2023 03:25:51 GMT
- Title: Speech enhancement with frequency domain auto-regressive modeling
- Authors: Anurenjan Purushothaman, Debottam Dutta, Rohit Kumar and Sriram
Ganapathy
- Abstract summary: Speech applications in far-field real world settings often deal with signals that are corrupted by reverberation.
We propose a unified framework of speech dereverberation for improving the speech quality and the automatic speech recognition (ASR) performance.
- Score: 34.55703785405481
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech applications in far-field real world settings often deal with signals
that are corrupted by reverberation. The task of dereverberation constitutes an
important step to improve the audible quality and to reduce the error rates in
applications like automatic speech recognition (ASR). We propose a unified
framework of speech dereverberation for improving the speech quality and the
ASR performance using the approach of envelope-carrier decomposition provided
by an autoregressive (AR) model. The AR model is applied in the frequency
domain of the sub-band speech signals to separate the envelope and carrier
parts. A novel neural architecture based on dual path long short term memory
(DPLSTM) model is proposed, which jointly enhances the sub-band envelope and
carrier components. The dereverberated envelope-carrier signals are modulated
and the sub-band signals are synthesized to reconstruct the audio signal back.
The DPLSTM model for dereverberation of envelope and carrier components also
allows the joint learning of the network weights for the down stream ASR task.
In the ASR tasks on the REVERB challenge dataset as well as on the VOiCES
dataset, we illustrate that the joint learning of speech dereverberation
network and the E2E ASR model yields significant performance improvements over
the baseline ASR system trained on log-mel spectrogram as well as other
benchmarks for dereverberation (average relative improvements of 10-24% over
the baseline system). The speech quality improvements, evaluated using
subjective listening tests, further highlight the improved quality of the
reconstructed audio.
Related papers
- ROSE: A Recognition-Oriented Speech Enhancement Framework in Air Traffic Control Using Multi-Objective Learning [6.60571587618006]
Radio speech echo is a specific phenomenon in the air traffic control (ATC) domain, which degrades speech quality and impacts automatic speech recognition (ASR) accuracy.
In this work, a time-domain recognition-oriented speech enhancement framework is proposed to improve speech intelligibility and advance ASR accuracy.
The framework serves as a plug-and-play tool in ATC scenarios and does not require additional retraining of the ASR model.
arXiv Detail & Related papers (2023-12-11T04:51:41Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - mdctGAN: Taming transformer-based GAN for speech super-resolution with
Modified DCT spectra [4.721572768262729]
Speech super-resolution (SSR) aims to recover a high resolution (HR) speech from its corresponding low resolution (LR) counterpart.
Recent SSR methods focus more on the reconstruction of the magnitude spectrogram, ignoring the importance of phase reconstruction.
We propose mdctGAN, a novel SSR framework based on modified discrete cosine transform (MDCT)
arXiv Detail & Related papers (2023-05-18T16:49:46Z) - Towards Improved Room Impulse Response Estimation for Speech Recognition [53.04440557465013]
We propose a novel approach for blind room impulse response (RIR) estimation systems in the context of far-field automatic speech recognition (ASR)
We first draw the connection between improved RIR estimation and improved ASR performance, as a means of evaluating neural RIR estimators.
We then propose a generative adversarial network (GAN) based architecture that encodes RIR features from reverberant speech and constructs an RIR from the encoded features.
arXiv Detail & Related papers (2022-11-08T00:40:27Z) - Model-based Deep Learning Receiver Design for Rate-Splitting Multiple
Access [65.21117658030235]
This work proposes a novel design for a practical RSMA receiver based on model-based deep learning (MBDL) methods.
The MBDL receiver is evaluated in terms of uncoded Symbol Error Rate (SER), throughput performance through Link-Level Simulations (LLS) and average training overhead.
Results reveal that the MBDL outperforms by a significant margin the SIC receiver with imperfect CSIR.
arXiv Detail & Related papers (2022-05-02T12:23:55Z) - CMGAN: Conformer-based Metric GAN for Speech Enhancement [6.480967714783858]
We propose a conformer-based metric generative adversarial network (CMGAN) for time-frequency domain.
In the generator, we utilize two-stage conformer blocks to aggregate all magnitude and complex spectrogram information.
The estimation of magnitude and complex spectrogram is decoupled in the decoder stage and then jointly incorporated to reconstruct the enhanced speech.
arXiv Detail & Related papers (2022-03-28T23:53:34Z) - ASR-Aware End-to-end Neural Diarization [15.172086811068962]
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.
Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features.
Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features.
arXiv Detail & Related papers (2022-02-02T21:17:14Z) - Neural Model Reprogramming with Similarity Based Mapping for
Low-Resource Spoken Command Recognition [71.96870151495536]
We propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR)
The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model.
We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech.
arXiv Detail & Related papers (2021-10-08T05:07:35Z) - Improving Stability of LS-GANs for Audio and Speech Signals [70.15099665710336]
We show that encoding departure from normality computed in this vector space into the generator optimization formulation helps to craft more comprehensive spectrograms.
We demonstrate the effectiveness of binding this metric for enhancing stability in training with less mode collapse compared to baseline GANs.
arXiv Detail & Related papers (2020-08-12T17:41:25Z) - Improving noise robust automatic speech recognition with single-channel
time-domain enhancement network [100.1041336974175]
We show that a single-channel time-domain denoising approach can significantly improve ASR performance.
We show that single-channel noise reduction can still improve ASR performance.
arXiv Detail & Related papers (2020-03-09T09:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.