Related papers: A Comparison of Speech Data Augmentation Methods Using S3PRL Toolkit

A Comparison of Speech Data Augmentation Methods Using S3PRL Toolkit

URL: http://arxiv.org/abs/2303.00510v2
Date: Fri, 29 Mar 2024 08:01:32 GMT
Title: A Comparison of Speech Data Augmentation Methods Using S3PRL Toolkit
Authors: Mina Huh, Ruchira Ray, Corey Karnei,
Abstract summary: We show that SpecAugment slightly improves the performance of HuBERT and wav2vec on the original dataset. Also, we show that models trained using the Gaussian Noise and Speed Perturbation dataset are more robust when tested with augmented test sets.
Score: 2.969929079464237
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data augmentations are known to improve robustness in speech-processing tasks. In this study, we summarize and compare different data augmentation strategies using S3PRL toolkit. We explore how HuBERT and wav2vec perform using different augmentation techniques (SpecAugment, Gaussian Noise, Speed Perturbation) for Phoneme Recognition (PR) and Automatic Speech Recognition (ASR) tasks. We evaluate model performance in terms of phoneme error rate (PER) and word error rate (WER). From the experiments, we observed that SpecAugment slightly improves the performance of HuBERT and wav2vec on the original dataset. Also, we show that models trained using the Gaussian Noise and Speed Perturbation dataset are more robust when tested with augmented test sets.

Related papers

An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR [12.197936305117407]
Augmenting the training data of automatic speech recognition systems with synthetic data generated by text-to-speech (TTS) or voice conversion (VC) has gained popularity in recent years. We leverage recently proposed flow-based TTS/VC models allowing greater speech diversity, and assess the respective impact of augmenting various speech attributes on the word error rate (WER) achieved by several ASR models.
arXiv Detail & Related papers (2025-03-11T23:09:06Z)
TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation [19.126525226518975]
We propose a speech separation model with significantly reduced parameters and computational costs. TIGER leverages prior knowledge to divide frequency bands and compresses frequency information. We show that TIGER achieves performance surpassing state-of-the-art (SOTA) model TF-GridNet.
arXiv Detail & Related papers (2024-10-02T12:21:06Z)
BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition [72.51848069125822]
We propose BRAVEn, an extension to the RAVEn method, which learns speech representations entirely from raw audio-visual data. Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods. Our results suggest that readily available unlabelled audio-visual data can largely replace costly transcribed data.
arXiv Detail & Related papers (2024-04-02T16:48:20Z)
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels [100.43280310123784]
We investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size. We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions. The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3.
arXiv Detail & Related papers (2023-03-25T00:37:34Z)
Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities. Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z)
Analyzing Robustness of End-to-End Neural Models for Automatic Speech Recognition [11.489161072526677]
We investigate robustness properties of pre-trained neural models for automatic speech recognition. In this work, we perform a robustness analysis of the pre-trained neural models wav2vec2, HuBERT and DistilHuBERT on the LibriSpeech and TIMIT datasets.
arXiv Detail & Related papers (2022-08-17T20:00:54Z)
Representative Subset Selection for Efficient Fine-Tuning in Self-Supervised Speech Recognition [6.450618373898492]
We consider the task of identifying an optimal subset of data for efficient fine-tuning in self-supervised speech models for ASR. We present the COWERAGE algorithm for representative subset selection in self-supervised ASR.
arXiv Detail & Related papers (2022-03-18T10:12:24Z)
Self-Supervised Learning for speech recognition with Intermediate layer supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL) ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers. Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z)
MixSpeech: Data Augmentation for Low-resource Automatic Speech Recognition [54.84624870942339]
MixSpeech is a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR) We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer. Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation.
arXiv Detail & Related papers (2021-02-25T03:40:43Z)
A Comparison of Discrete Latent Variable Models for Speech Representation Learning [46.52258734975676]
This paper presents a comparison of two different approaches which are broadly based on predicting future time-steps or auto-encoding the input signal. Results show that future time-step prediction with vq-wav2vec achieves better performance.
arXiv Detail & Related papers (2020-10-24T01:22:14Z)
You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model. Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.