Scenario Aware Speech Recognition: Advancements for Apollo Fearless
Steps & CHiME-4 Corpora
- URL: http://arxiv.org/abs/2109.11086v1
- Date: Thu, 23 Sep 2021 00:43:32 GMT
- Title: Scenario Aware Speech Recognition: Advancements for Apollo Fearless
Steps & CHiME-4 Corpora
- Authors: Szu-Jui Chen, Wei Xia, John H.L. Hansen
- Abstract summary: We consider a general non-semantic speech representation, which is trained with a self-supervised criteria based on triplet loss called TRILL.
We observe +5.42% and +3.18% relative WER improvement for the development and evaluation sets of Fearless Steps.
- Score: 70.46867541361982
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this study, we propose to investigate triplet loss for the purpose of an
alternative feature representation for ASR. We consider a general non-semantic
speech representation, which is trained with a self-supervised criteria based
on triplet loss called TRILL, for acoustic modeling to represent the acoustic
characteristics of each audio. This strategy is then applied to the CHiME-4
corpus and CRSS-UTDallas Fearless Steps Corpus, with emphasis on the 100-hour
challenge corpus which consists of 5 selected NASA Apollo-11 channels. An
analysis of the extracted embeddings provides the foundation needed to
characterize training utterances into distinct groups based on acoustic
distinguishing properties. Moreover, we also demonstrate that triplet-loss
based embedding performs better than i-Vector in acoustic modeling, confirming
that the triplet loss is more effective than a speaker feature. With additional
techniques such as pronunciation and silence probability modeling, plus
multi-style training, we achieve a +5.42% and +3.18% relative WER improvement
for the development and evaluation sets of the Fearless Steps Corpus. To
explore generalization, we further test the same technique on the 1 channel
track of CHiME-4 and observe a +11.90% relative WER improvement for real test
data.
Related papers
- Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - Reassessing Noise Augmentation Methods in the Context of Adversarial Speech [12.488332326259469]
We investigate if noise-augmented training can concurrently improve adversarial robustness in automatic speech recognition systems.
The results demonstrate that noise augmentation not only improves model performance on noisy speech but also the model's robustness to adversarial attacks.
arXiv Detail & Related papers (2024-09-03T11:51:10Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - D4AM: A General Denoising Framework for Downstream Acoustic Models [45.04967351760919]
Speech enhancement (SE) can be used as a front-end strategy to aid automatic speech recognition (ASR) systems.
Existing training objectives of SE methods are not fully effective at integrating speech-text and noisy-clean paired data for training toward unseen ASR systems.
We propose a general denoising framework, D4AM, for various downstream acoustic models.
arXiv Detail & Related papers (2023-11-28T08:27:27Z) - Low-complexity deep learning frameworks for acoustic scene
classification [64.22762153453175]
We present low-complexity deep learning frameworks for acoustic scene classification (ASC)
The proposed frameworks can be separated into four main steps: Front-end spectrogram extraction, online data augmentation, back-end classification, and late fusion of predicted probabilities.
Our experiments conducted on DCASE 2022 Task 1 Development dataset have fullfiled the requirement of low-complexity and achieved the best classification accuracy of 60.1%.
arXiv Detail & Related papers (2022-06-13T11:41:39Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Evaluation of Deep-Learning-Based Voice Activity Detectors and Room
Impulse Response Models in Reverberant Environments [13.558688470594676]
State-of-the-art deep-learning-based voice activity detectors (VADs) are often trained with anechoic data.
We simulate an augmented training set that contains nearly five million utterances.
We consider five different models to generate RIRs, and five different VADs that are trained with the augmented training set.
arXiv Detail & Related papers (2021-06-25T09:05:38Z) - Feature Replacement and Combination for Hybrid ASR Systems [47.74348197215634]
We investigate the usefulness of one of these front-end frameworks, namely wav2vec, for hybrid ASR systems.
In addition to deploying a pre-trained feature extractor, we explore how to make use of an existing acoustic model (AM) trained on the same task with different features.
We obtain a relative improvement of 4% and 6% over our previous best model on the LibriSpeech test-clean and test-other sets.
arXiv Detail & Related papers (2021-04-09T11:04:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.