Efficient acoustic feature transformation in mismatched environments
using a Guided-GAN
- URL: http://arxiv.org/abs/2210.00721v3
- Date: Thu, 6 Oct 2022 06:33:38 GMT
- Title: Efficient acoustic feature transformation in mismatched environments
using a Guided-GAN
- Authors: Walter Heymans, Marelie H. Davel, Charl van Heerden
- Abstract summary: We propose a new framework to improve automatic speech recognition systems in resource-scarce environments.
We use a generative adversarial network (GAN) operating on acoustic input features to enhance the features of mismatched data.
With less than one hour of data, an ASR system trained on good quality data, and evaluated on mismatched audio is improved by between 11.5% and 19.7% relative word error rate (WER)
- Score: 1.495380389108477
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We propose a new framework to improve automatic speech recognition (ASR)
systems in resource-scarce environments using a generative adversarial network
(GAN) operating on acoustic input features. The GAN is used to enhance the
features of mismatched data prior to decoding, or can optionally be used to
fine-tune the acoustic model. We achieve improvements that are comparable to
multi-style training (MTR), but at a lower computational cost. With less than
one hour of data, an ASR system trained on good quality data, and evaluated on
mismatched audio is improved by between 11.5% and 19.7% relative word error
rate (WER). Experiments demonstrate that the framework can be very useful in
under-resourced environments where training data and computational resources
are limited. The GAN does not require parallel training data, because it
utilises a baseline acoustic model to provide an additional loss term that
guides the generator to create acoustic features that are better classified by
the baseline.
Related papers
- Acoustic Model Optimization over Multiple Data Sources: Merging and Valuation [13.009945735929445]
We propose a novel paradigm to solve salient problems plaguing the Automatic Speech Recognition field.
In the first stage, multiple acoustic models are trained based upon different subsets of the complete speech data.
In the second stage, two novel algorithms are utilized to generate a high-quality acoustic model.
arXiv Detail & Related papers (2024-10-21T03:48:23Z) - D4AM: A General Denoising Framework for Downstream Acoustic Models [45.04967351760919]
Speech enhancement (SE) can be used as a front-end strategy to aid automatic speech recognition (ASR) systems.
Existing training objectives of SE methods are not fully effective at integrating speech-text and noisy-clean paired data for training toward unseen ASR systems.
We propose a general denoising framework, D4AM, for various downstream acoustic models.
arXiv Detail & Related papers (2023-11-28T08:27:27Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Heterogeneous Reservoir Computing Models for Persian Speech Recognition [0.0]
Reservoir computing models (RC) models have been proven inexpensive to train, have vastly fewer parameters, and are compatible with emergent hardware technologies.
We propose heterogeneous single and multi-layer ESNs to create non-linear transformations of the inputs that capture temporal context at different scales.
arXiv Detail & Related papers (2022-05-25T09:15:15Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Neural Model Reprogramming with Similarity Based Mapping for
Low-Resource Spoken Command Recognition [71.96870151495536]
We propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR)
The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model.
We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech.
arXiv Detail & Related papers (2021-10-08T05:07:35Z) - Raw Waveform Encoder with Multi-Scale Globally Attentive Locally
Recurrent Networks for End-to-End Speech Recognition [45.858039215825656]
We propose a new encoder that adopts globally attentive locally recurrent (GALR) networks and directly takes raw waveform as input.
Experiments are conducted on a benchmark dataset AISHELL-2 and two large-scale Mandarin speech corpus of 5,000 hours and 21,000 hours.
arXiv Detail & Related papers (2021-06-08T12:12:33Z) - Feature Replacement and Combination for Hybrid ASR Systems [47.74348197215634]
We investigate the usefulness of one of these front-end frameworks, namely wav2vec, for hybrid ASR systems.
In addition to deploying a pre-trained feature extractor, we explore how to make use of an existing acoustic model (AM) trained on the same task with different features.
We obtain a relative improvement of 4% and 6% over our previous best model on the LibriSpeech test-clean and test-other sets.
arXiv Detail & Related papers (2021-04-09T11:04:58Z) - Improving noise robust automatic speech recognition with single-channel
time-domain enhancement network [100.1041336974175]
We show that a single-channel time-domain denoising approach can significantly improve ASR performance.
We show that single-channel noise reduction can still improve ASR performance.
arXiv Detail & Related papers (2020-03-09T09:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.