D4AM: A General Denoising Framework for Downstream Acoustic Models
- URL: http://arxiv.org/abs/2311.16595v1
- Date: Tue, 28 Nov 2023 08:27:27 GMT
- Title: D4AM: A General Denoising Framework for Downstream Acoustic Models
- Authors: Chi-Chang Lee, Yu Tsao, Hsin-Min Wang, Chu-Song Chen
- Abstract summary: Speech enhancement (SE) can be used as a front-end strategy to aid automatic speech recognition (ASR) systems.
Existing training objectives of SE methods are not fully effective at integrating speech-text and noisy-clean paired data for training toward unseen ASR systems.
We propose a general denoising framework, D4AM, for various downstream acoustic models.
- Score: 45.04967351760919
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The performance of acoustic models degrades notably in noisy environments.
Speech enhancement (SE) can be used as a front-end strategy to aid automatic
speech recognition (ASR) systems. However, existing training objectives of SE
methods are not fully effective at integrating speech-text and noisy-clean
paired data for training toward unseen ASR systems. In this study, we propose a
general denoising framework, D4AM, for various downstream acoustic models. Our
framework fine-tunes the SE model with the backward gradient according to a
specific acoustic model and the corresponding classification objective. In
addition, our method aims to consider the regression objective as an auxiliary
loss to make the SE model generalize to other unseen acoustic models. To
jointly train an SE unit with regression and classification objectives, D4AM
uses an adjustment scheme to directly estimate suitable weighting coefficients
rather than undergoing a grid search process with additional training costs.
The adjustment scheme consists of two parts: gradient calibration and
regression objective weighting. The experimental results show that D4AM can
consistently and effectively provide improvements to various unseen acoustic
models and outperforms other combination setups. Specifically, when evaluated
on the Google ASR API with real noisy data completely unseen during SE
training, D4AM achieves a relative WER reduction of 24.65% compared with the
direct feeding of noisy input. To our knowledge, this is the first work that
deploys an effective combination scheme of regression (denoising) and
classification (ASR) objectives to derive a general pre-processor applicable to
various unseen ASR systems. Our code is available at
https://github.com/ChangLee0903/D4AM.
Related papers
- Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Bring the Noise: Introducing Noise Robustness to Pretrained Automatic
Speech Recognition [13.53738829631595]
We propose a novel method to extract the denoising capabilities, that can be applied to any encoder-decoder architecture.
We train our pre-processor on the Noisy Speech Database (NSD) to reconstruct denoised spectrograms from noisy inputs.
We show that the Cleancoder is able to filter noise from speech and that it improves the total Word Error Rate (WER) of the downstream model in noisy conditions.
arXiv Detail & Related papers (2023-09-05T11:34:21Z) - Convoifilter: A case study of doing cocktail party speech recognition [59.80042864360884]
The model can decrease ASR's word error rate (WER) from 80% to 26.4% through this approach.
We openly share our pre-trained model to foster further research hf.co/nguyenvulebinh/voice-filter.
arXiv Detail & Related papers (2023-08-22T12:09:30Z) - Efficient acoustic feature transformation in mismatched environments
using a Guided-GAN [1.495380389108477]
We propose a new framework to improve automatic speech recognition systems in resource-scarce environments.
We use a generative adversarial network (GAN) operating on acoustic input features to enhance the features of mismatched data.
With less than one hour of data, an ASR system trained on good quality data, and evaluated on mismatched audio is improved by between 11.5% and 19.7% relative word error rate (WER)
arXiv Detail & Related papers (2022-10-03T05:33:28Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Feature Replacement and Combination for Hybrid ASR Systems [47.74348197215634]
We investigate the usefulness of one of these front-end frameworks, namely wav2vec, for hybrid ASR systems.
In addition to deploying a pre-trained feature extractor, we explore how to make use of an existing acoustic model (AM) trained on the same task with different features.
We obtain a relative improvement of 4% and 6% over our previous best model on the LibriSpeech test-clean and test-other sets.
arXiv Detail & Related papers (2021-04-09T11:04:58Z) - Unsupervised Domain Adaptation for Acoustic Scene Classification Using
Band-Wise Statistics Matching [69.24460241328521]
Machine learning algorithms can be negatively affected by mismatches between training (source) and test (target) data distributions.
We propose an unsupervised domain adaptation method that consists of aligning the first- and second-order sample statistics of each frequency band of target-domain acoustic scenes to the ones of the source-domain training dataset.
We show that the proposed method outperforms the state-of-the-art unsupervised methods found in the literature in terms of both source- and target-domain classification accuracy.
arXiv Detail & Related papers (2020-04-30T23:56:05Z) - Statistical Context-Dependent Units Boundary Correction for Corpus-based
Unit-Selection Text-to-Speech [1.4337588659482519]
We present an innovative technique for speaker adaptation in order to improve the accuracy of segmentation with application to unit-selection Text-To-Speech (TTS) systems.
Unlike conventional techniques for speaker adaptation, we aim to use only context dependent characteristics extrapolated with linguistic analysis techniques.
arXiv Detail & Related papers (2020-03-05T12:42:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.