On Scaling Contrastive Representations for Low-Resource Speech
Recognition
- URL: http://arxiv.org/abs/2102.00850v1
- Date: Mon, 1 Feb 2021 13:58:02 GMT
- Title: On Scaling Contrastive Representations for Low-Resource Speech
Recognition
- Authors: Lasse Borgholt, Tycho Max Sylvester Tax, Jakob Drachmann Havtorn, Lars
Maal{\o}e, Christian Igel
- Abstract summary: We train a state-of-the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework.
We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor.
- Score: 12.447872366013224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in self-supervised learning through contrastive training have
shown that it is possible to learn a competitive speech recognition system with
as little as 10 minutes of labeled data. However, these systems are
computationally expensive since they require pre-training followed by
fine-tuning in a large parameter space. We explore the performance of such
systems without fine-tuning by training a state-of-the-art speech recognizer on
the fixed representations from the computationally demanding wav2vec 2.0
framework. We find performance to decrease without fine-tuning and, in the
extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor. In
addition, we find that wav2vec 2.0 representations live in a low dimensional
subspace and that decorrelating the features of the representations can
stabilize training of the automatic speech recognizer. Finally, we propose a
bidirectional extension to the original wav2vec framework that consistently
improves performance.
Related papers
- Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - AV-data2vec: Self-supervised Learning of Audio-Visual Speech
Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations.
Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Robust Speaker Recognition with Transformers Using wav2vec 2.0 [7.419725234099729]
This paper presents an investigation of using wav2vec 2.0 deep speech representations for the speaker recognition task.
It is concluded that Contrastive Predictive Coding pretraining scheme efficiently utilizes the power of unlabeled data.
arXiv Detail & Related papers (2022-03-28T20:59:58Z) - STC speaker recognition systems for the NIST SRE 2021 [56.05258832139496]
This paper presents a description of STC Ltd. systems submitted to the NIST 2021 Speaker Recognition Evaluation.
These systems consists of a number of diverse subsystems based on using deep neural networks as feature extractors.
For video modality we developed our best solution with RetinaFace face detector and deep ResNet face embeddings extractor trained on large face image datasets.
arXiv Detail & Related papers (2021-11-03T15:31:01Z) - Exploring Wav2vec 2.0 fine-tuning for improved speech emotion
recognition [78.92428622630861]
wav2vec 2.0 can be used for speech emotion recognition (SER)
Two baseline methods, vanilla fine-tuning (V-FT) and task adaptive pretraining (TAPT) are first presented.
We show V-FT is able to outperform state-of-the-art models on the IEMOCAP dataset.
We also introduce a novel fine-tuning method termed P-TAPT, which modifies the TAPT objective to learn contextualized emotion representations.
arXiv Detail & Related papers (2021-10-12T19:55:55Z) - Performance-Efficiency Trade-offs in Unsupervised Pre-training for
Speech Recognition [32.61769580342906]
We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency.
We introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions.
arXiv Detail & Related papers (2021-09-14T17:58:09Z) - Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings [16.829474982595837]
We propose a transfer learning method for speech emotion recognition.
We combine the output of several layers from the pre-trained model using trainable weights which are learned jointly with the downstream model.
We evaluate our proposed approaches on two standard emotion databases IEMOCAP and RAVDESS, showing superior performance compared to results in the literature.
arXiv Detail & Related papers (2021-04-08T04:31:58Z) - Exploring wav2vec 2.0 on speaker verification and language
identification [9.047596226273495]
Wav2vec 2.0 is a proposed self-supervised framework for speech representation learning.
In this work, we attempt to extend wav2vec 2.0 to speaker verification and language identification.
For speaker verification, we obtain a new state-of-the-art result, Equal Error Rate (EER) of 3.61% on the VoxCeleb1 dataset.
For language identification, we obtain an EER of 12.02% on 1 second condition and an EER of 3.47% on full-length condition of the AP17-OLR dataset.
arXiv Detail & Related papers (2020-12-11T08:22:23Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.