D4AM: A General Denoising Framework for Downstream Acoustic Models
- URL: http://arxiv.org/abs/2311.16595v1
- Date: Tue, 28 Nov 2023 08:27:27 GMT
- Title: D4AM: A General Denoising Framework for Downstream Acoustic Models
- Authors: Chi-Chang Lee, Yu Tsao, Hsin-Min Wang, Chu-Song Chen
- Abstract summary: Speech enhancement (SE) can be used as a front-end strategy to aid automatic speech recognition (ASR) systems.
Existing training objectives of SE methods are not fully effective at integrating speech-text and noisy-clean paired data for training toward unseen ASR systems.
We propose a general denoising framework, D4AM, for various downstream acoustic models.
- Score: 45.04967351760919
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The performance of acoustic models degrades notably in noisy environments.
Speech enhancement (SE) can be used as a front-end strategy to aid automatic
speech recognition (ASR) systems. However, existing training objectives of SE
methods are not fully effective at integrating speech-text and noisy-clean
paired data for training toward unseen ASR systems. In this study, we propose a
general denoising framework, D4AM, for various downstream acoustic models. Our
framework fine-tunes the SE model with the backward gradient according to a
specific acoustic model and the corresponding classification objective. In
addition, our method aims to consider the regression objective as an auxiliary
loss to make the SE model generalize to other unseen acoustic models. To
jointly train an SE unit with regression and classification objectives, D4AM
uses an adjustment scheme to directly estimate suitable weighting coefficients
rather than undergoing a grid search process with additional training costs.
The adjustment scheme consists of two parts: gradient calibration and
regression objective weighting. The experimental results show that D4AM can
consistently and effectively provide improvements to various unseen acoustic
models and outperforms other combination setups. Specifically, when evaluated
on the Google ASR API with real noisy data completely unseen during SE
training, D4AM achieves a relative WER reduction of 24.65% compared with the
direct feeding of noisy input. To our knowledge, this is the first work that
deploys an effective combination scheme of regression (denoising) and
classification (ASR) objectives to derive a general pre-processor applicable to
various unseen ASR systems. Our code is available at
https://github.com/ChangLee0903/D4AM.
Related papers
- Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - Acoustic Model Optimization over Multiple Data Sources: Merging and Valuation [13.009945735929445]
We propose a novel paradigm to solve salient problems plaguing the Automatic Speech Recognition field.
In the first stage, multiple acoustic models are trained based upon different subsets of the complete speech data.
In the second stage, two novel algorithms are utilized to generate a high-quality acoustic model.
arXiv Detail & Related papers (2024-10-21T03:48:23Z) - Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Efficient acoustic feature transformation in mismatched environments
using a Guided-GAN [1.495380389108477]
We propose a new framework to improve automatic speech recognition systems in resource-scarce environments.
We use a generative adversarial network (GAN) operating on acoustic input features to enhance the features of mismatched data.
With less than one hour of data, an ASR system trained on good quality data, and evaluated on mismatched audio is improved by between 11.5% and 19.7% relative word error rate (WER)
arXiv Detail & Related papers (2022-10-03T05:33:28Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Feature Replacement and Combination for Hybrid ASR Systems [47.74348197215634]
We investigate the usefulness of one of these front-end frameworks, namely wav2vec, for hybrid ASR systems.
In addition to deploying a pre-trained feature extractor, we explore how to make use of an existing acoustic model (AM) trained on the same task with different features.
We obtain a relative improvement of 4% and 6% over our previous best model on the LibriSpeech test-clean and test-other sets.
arXiv Detail & Related papers (2021-04-09T11:04:58Z) - Unsupervised Domain Adaptation for Acoustic Scene Classification Using
Band-Wise Statistics Matching [69.24460241328521]
Machine learning algorithms can be negatively affected by mismatches between training (source) and test (target) data distributions.
We propose an unsupervised domain adaptation method that consists of aligning the first- and second-order sample statistics of each frequency band of target-domain acoustic scenes to the ones of the source-domain training dataset.
We show that the proposed method outperforms the state-of-the-art unsupervised methods found in the literature in terms of both source- and target-domain classification accuracy.
arXiv Detail & Related papers (2020-04-30T23:56:05Z) - Statistical Context-Dependent Units Boundary Correction for Corpus-based
Unit-Selection Text-to-Speech [1.4337588659482519]
We present an innovative technique for speaker adaptation in order to improve the accuracy of segmentation with application to unit-selection Text-To-Speech (TTS) systems.
Unlike conventional techniques for speaker adaptation, we aim to use only context dependent characteristics extrapolated with linguistic analysis techniques.
arXiv Detail & Related papers (2020-03-05T12:42:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.