Related papers: D4AM: A General Denoising Framework for Downstream Acoustic Models

D4AM: A General Denoising Framework for Downstream Acoustic Models

URL: http://arxiv.org/abs/2311.16595v1
Date: Tue, 28 Nov 2023 08:27:27 GMT
Title: D4AM: A General Denoising Framework for Downstream Acoustic Models
Authors: Chi-Chang Lee, Yu Tsao, Hsin-Min Wang, Chu-Song Chen
Abstract summary: Speech enhancement (SE) can be used as a front-end strategy to aid automatic speech recognition (ASR) systems. Existing training objectives of SE methods are not fully effective at integrating speech-text and noisy-clean paired data for training toward unseen ASR systems. We propose a general denoising framework, D4AM, for various downstream acoustic models.
Score: 45.04967351760919
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The performance of acoustic models degrades notably in noisy environments. Speech enhancement (SE) can be used as a front-end strategy to aid automatic speech recognition (ASR) systems. However, existing training objectives of SE methods are not fully effective at integrating speech-text and noisy-clean paired data for training toward unseen ASR systems. In this study, we propose a general denoising framework, D4AM, for various downstream acoustic models. Our framework fine-tunes the SE model with the backward gradient according to a specific acoustic model and the corresponding classification objective. In addition, our method aims to consider the regression objective as an auxiliary loss to make the SE model generalize to other unseen acoustic models. To jointly train an SE unit with regression and classification objectives, D4AM uses an adjustment scheme to directly estimate suitable weighting coefficients rather than undergoing a grid search process with additional training costs. The adjustment scheme consists of two parts: gradient calibration and regression objective weighting. The experimental results show that D4AM can consistently and effectively provide improvements to various unseen acoustic models and outperforms other combination setups. Specifically, when evaluated on the Google ASR API with real noisy data completely unseen during SE training, D4AM achieves a relative WER reduction of 24.65% compared with the direct feeding of noisy input. To our knowledge, this is the first work that deploys an effective combination scheme of regression (denoising) and classification (ASR) objectives to derive a general pre-processor applicable to various unseen ASR systems. Our code is available at https://github.com/ChangLee0903/D4AM.

Related papers

Performance improvement of spatial semantic segmentation with enriched audio features and agent-based error correction for DCASE 2025 Challenge Task 4 [2.68085089595424]
This report presents submission systems for Task 4 of the DCASE 2025 Challenge.<n>It incorporates additional audio features into the embedding feature extracted from the mel-spectral feature.<n>Second, an agent-based label correction system is applied to the outputs processed by the S5 system.
arXiv Detail & Related papers (2025-06-26T12:27:52Z)
Machine Unlearning for Robust DNNs: Attribution-Guided Partitioning and Neuron Pruning in Noisy Environments [5.8166742412657895]
Deep neural networks (DNNs) have achieved remarkable success across diverse domains, but their performance can be severely degraded by noisy or corrupted training data.<n>We propose a novel framework that integrates attribution-guided data partitioning, discriminative neuron pruning, and targeted fine-tuning to mitigate the impact of noisy samples.<n>Our framework achieves approximately a 10% absolute accuracy improvement over standard retraining on CIFAR-10 with injected label noise.
arXiv Detail & Related papers (2025-06-13T09:37:11Z)
Adaptive Noise Resilient Keyword Spotting Using One-Shot Learning [5.967661928760498]
Keywords spotting (KWS) is a key component of smart devices, enabling efficient and intuitive audio interaction.<n> KWS systems often suffer performance degradation under real-world operating conditions.<n>This study proposes a low computational approach for continuous noise adaptation of pretrained neural networks used for KWS classification.
arXiv Detail & Related papers (2025-05-14T11:39:47Z)
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance. We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z)
Acoustic Model Optimization over Multiple Data Sources: Merging and Valuation [13.009945735929445]
We propose a novel paradigm to solve salient problems plaguing the Automatic Speech Recognition field. In the first stage, multiple acoustic models are trained based upon different subsets of the complete speech data. In the second stage, two novel algorithms are utilized to generate a high-quality acoustic model.
arXiv Detail & Related papers (2024-10-21T03:48:23Z)
Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets. We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z)
Efficient acoustic feature transformation in mismatched environments using a Guided-GAN [1.495380389108477]
We propose a new framework to improve automatic speech recognition systems in resource-scarce environments. We use a generative adversarial network (GAN) operating on acoustic input features to enhance the features of mismatched data. With less than one hour of data, an ASR system trained on good quality data, and evaluated on mismatched audio is improved by between 11.5% and 19.7% relative word error rate (WER)
arXiv Detail & Related papers (2022-10-03T05:33:28Z)
Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments. We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition. We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z)
Feature Replacement and Combination for Hybrid ASR Systems [47.74348197215634]
We investigate the usefulness of one of these front-end frameworks, namely wav2vec, for hybrid ASR systems. In addition to deploying a pre-trained feature extractor, we explore how to make use of an existing acoustic model (AM) trained on the same task with different features. We obtain a relative improvement of 4% and 6% over our previous best model on the LibriSpeech test-clean and test-other sets.
arXiv Detail & Related papers (2021-04-09T11:04:58Z)
Unsupervised Domain Adaptation for Acoustic Scene Classification Using Band-Wise Statistics Matching [69.24460241328521]
Machine learning algorithms can be negatively affected by mismatches between training (source) and test (target) data distributions. We propose an unsupervised domain adaptation method that consists of aligning the first- and second-order sample statistics of each frequency band of target-domain acoustic scenes to the ones of the source-domain training dataset. We show that the proposed method outperforms the state-of-the-art unsupervised methods found in the literature in terms of both source- and target-domain classification accuracy.
arXiv Detail & Related papers (2020-04-30T23:56:05Z)
Statistical Context-Dependent Units Boundary Correction for Corpus-based Unit-Selection Text-to-Speech [1.4337588659482519]
We present an innovative technique for speaker adaptation in order to improve the accuracy of segmentation with application to unit-selection Text-To-Speech (TTS) systems. Unlike conventional techniques for speaker adaptation, we aim to use only context dependent characteristics extrapolated with linguistic analysis techniques.
arXiv Detail & Related papers (2020-03-05T12:42:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.