Related papers: Scenario Aware Speech Recognition: Advancements for Apollo Fearless Steps & CHiME-4 Corpora

Scenario Aware Speech Recognition: Advancements for Apollo Fearless Steps & CHiME-4 Corpora

URL: http://arxiv.org/abs/2109.11086v1
Date: Thu, 23 Sep 2021 00:43:32 GMT
Title: Scenario Aware Speech Recognition: Advancements for Apollo Fearless Steps & CHiME-4 Corpora
Authors: Szu-Jui Chen, Wei Xia, John H.L. Hansen
Abstract summary: We consider a general non-semantic speech representation, which is trained with a self-supervised criteria based on triplet loss called TRILL. We observe +5.42% and +3.18% relative WER improvement for the development and evaluation sets of Fearless Steps.
Score: 70.46867541361982
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this study, we propose to investigate triplet loss for the purpose of an alternative feature representation for ASR. We consider a general non-semantic speech representation, which is trained with a self-supervised criteria based on triplet loss called TRILL, for acoustic modeling to represent the acoustic characteristics of each audio. This strategy is then applied to the CHiME-4 corpus and CRSS-UTDallas Fearless Steps Corpus, with emphasis on the 100-hour challenge corpus which consists of 5 selected NASA Apollo-11 channels. An analysis of the extracted embeddings provides the foundation needed to characterize training utterances into distinct groups based on acoustic distinguishing properties. Moreover, we also demonstrate that triplet-loss based embedding performs better than i-Vector in acoustic modeling, confirming that the triplet loss is more effective than a speaker feature. With additional techniques such as pronunciation and silence probability modeling, plus multi-style training, we achieve a +5.42% and +3.18% relative WER improvement for the development and evaluation sets of the Fearless Steps Corpus. To explore generalization, we further test the same technique on the 1 channel track of CHiME-4 and observe a +11.90% relative WER improvement for real test data.

Related papers

Who Can Withstand Chat-Audio Attacks? An Evaluation Benchmark for Large Language Models [70.99768410765502]
Adrial audio attacks pose a significant threat to the growing use of large language models (LLMs) in voice-based human-machine interactions. We introduce the Chat-Audio Attacks benchmark including four distinct types of audio attacks. We evaluate six state-of-the-art LLMs with voice interaction capabilities, including Gemini-1.5-Pro, GPT-4o, and others.
arXiv Detail & Related papers (2024-11-22T10:30:48Z)
Reassessing Noise Augmentation Methods in the Context of Adversarial Speech [12.488332326259469]
We investigate if noise-augmented training can concurrently improve adversarial robustness in automatic speech recognition systems. The results demonstrate that noise augmentation not only improves model performance on noisy speech but also the model's robustness to adversarial attacks.
arXiv Detail & Related papers (2024-09-03T11:51:10Z)
MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders. Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z)
D4AM: A General Denoising Framework for Downstream Acoustic Models [45.04967351760919]
Speech enhancement (SE) can be used as a front-end strategy to aid automatic speech recognition (ASR) systems. Existing training objectives of SE methods are not fully effective at integrating speech-text and noisy-clean paired data for training toward unseen ASR systems. We propose a general denoising framework, D4AM, for various downstream acoustic models.
arXiv Detail & Related papers (2023-11-28T08:27:27Z)
Low-complexity deep learning frameworks for acoustic scene classification [64.22762153453175]
We present low-complexity deep learning frameworks for acoustic scene classification (ASC) The proposed frameworks can be separated into four main steps: Front-end spectrogram extraction, online data augmentation, back-end classification, and late fusion of predicted probabilities. Our experiments conducted on DCASE 2022 Task 1 Development dataset have fullfiled the requirement of low-complexity and achieved the best classification accuracy of 60.1%.
arXiv Detail & Related papers (2022-06-13T11:41:39Z)
Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments. We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition. We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z)
Evaluation of Deep-Learning-Based Voice Activity Detectors and Room Impulse Response Models in Reverberant Environments [13.558688470594676]
State-of-the-art deep-learning-based voice activity detectors (VADs) are often trained with anechoic data. We simulate an augmented training set that contains nearly five million utterances. We consider five different models to generate RIRs, and five different VADs that are trained with the augmented training set.
arXiv Detail & Related papers (2021-06-25T09:05:38Z)
Feature Replacement and Combination for Hybrid ASR Systems [47.74348197215634]
We investigate the usefulness of one of these front-end frameworks, namely wav2vec, for hybrid ASR systems. In addition to deploying a pre-trained feature extractor, we explore how to make use of an existing acoustic model (AM) trained on the same task with different features. We obtain a relative improvement of 4% and 6% over our previous best model on the LibriSpeech test-clean and test-other sets.
arXiv Detail & Related papers (2021-04-09T11:04:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.