Data Augmenting Contrastive Learning of Speech Representations in the
Time Domain
- URL: http://arxiv.org/abs/2007.00991v1
- Date: Thu, 2 Jul 2020 09:59:51 GMT
- Title: Data Augmenting Contrastive Learning of Speech Representations in the
Time Domain
- Authors: Eugene Kharitonov and Morgane Rivi\`ere and Gabriel Synnaeve and Lior
Wolf and Pierre-Emmanuel Mazar\'e and Matthijs Douze and Emmanuel Dupoux
- Abstract summary: We introduce WavAugment, a time-domain data augmentation library.
We find that a combination of pitch modification, additive noise and reverberation substantially increase the performance of CPC.
We also show that time-domain data augmentation consistently improves downstream limited-supervision phoneme classification tasks by a factor of 12-15% relative.
- Score: 92.50459322938528
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Predictive Coding (CPC), based on predicting future segments of
speech based on past segments is emerging as a powerful algorithm for
representation learning of speech signal. However, it still under-performs
other methods on unsupervised evaluation benchmarks. Here, we introduce
WavAugment, a time-domain data augmentation library and find that applying
augmentation in the past is generally more efficient and yields better
performances than other methods. We find that a combination of pitch
modification, additive noise and reverberation substantially increase the
performance of CPC (relative improvement of 18-22%), beating the reference
Libri-light results with 600 times less data. Using an out-of-domain dataset,
time-domain data augmentation can push CPC to be on par with the state of the
art on the Zero Speech Benchmark 2017. We also show that time-domain data
augmentation consistently improves downstream limited-supervision phoneme
classification tasks by a factor of 12-15% relative.
Related papers
- Data Augmentation for Traffic Classification [54.92823760790628]
Data Augmentation (DA) is a technique widely adopted in Computer Vision (CV) and Natural Language Processing (NLP) tasks.
DA has struggled to gain traction in networking contexts, particularly in Traffic Classification (TC) tasks.
arXiv Detail & Related papers (2024-01-19T15:25:09Z) - Time Series Contrastive Learning with Information-Aware Augmentations [57.45139904366001]
A key component of contrastive learning is to select appropriate augmentations imposing some priors to construct feasible positive samples.
How to find the desired augmentations of time series data that are meaningful for given contrastive learning tasks and datasets remains an open question.
We propose a new contrastive learning approach with information-aware augmentations, InfoTS, that adaptively selects optimal augmentations for time series representation learning.
arXiv Detail & Related papers (2023-03-21T15:02:50Z) - Fine-tuning Strategies for Faster Inference using Speech Self-Supervised
Models: A Comparative Study [25.58608455210458]
Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings.
This article explores different approaches that may be deployed during the fine-tuning to reduce the computations needed in the SSL encoder.
arXiv Detail & Related papers (2023-03-12T19:52:34Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Training Strategies for Improved Lip-reading [61.661446956793604]
We investigate the performance of state-of-the-art data augmentation approaches, temporal models and other training strategies.
A combination of all the methods results in a classification accuracy of 93.4%, which is an absolute improvement of 4.6% over the current state-of-the-art performance.
An error analysis of the various training strategies reveals that the performance improves by increasing the classification accuracy of hard-to-recognise words.
arXiv Detail & Related papers (2022-09-03T09:38:11Z) - Data Augmentation based Consistency Contrastive Pre-training for
Automatic Speech Recognition [18.303072203996347]
Self-supervised acoustic pre-training has achieved amazing results on the automatic speech recognition (ASR) task.
Most of the successful acoustic pre-training methods use contrastive learning to learn the acoustic representations.
In this letter, we design a novel consistency contrastive learning (CCL) method by utilizing data augmentation for acoustic pre-training.
arXiv Detail & Related papers (2021-12-23T13:23:17Z) - ImportantAug: a data augmentation agent for speech [10.453223310129408]
We introduce ImportantAug, a technique to augment training data for speech classification and recognition models.
Importance is predicted for each utterance by a data augmentation agent that is trained to maximize the amount of noise it adds.
arXiv Detail & Related papers (2021-12-14T04:37:04Z) - Improving RNN-T ASR Performance with Date-Time and Location Awareness [6.308539010172309]
We show that contextual information, when used individually, improves overall performance by as much as 3.48% relative to the baseline.
On specific domains, these contextual signals show improvements as high as 11.5%, without any significant degradation on others.
Our results indicate that with limited data to train the ASR model, contextual signals can improve the performance significantly.
arXiv Detail & Related papers (2021-06-11T05:57:30Z) - Improving low-resource ASR performance with untranscribed out-of-domain
data [8.376091455761259]
Semi-supervised training (SST) is a common approach to leverage untranscribed/unlabeled speech data.
We look to improve performance on conversational/telephony speech (target domain) using web resources.
arXiv Detail & Related papers (2021-06-02T15:23:34Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.