Related papers: Self-supervised Learning with Random-projection Quantizer for Speech Recognition

Self-supervised Learning with Random-projection Quantizer for Speech Recognition

URL: http://arxiv.org/abs/2202.01855v1
Date: Thu, 3 Feb 2022 21:29:04 GMT
Title: Self-supervised Learning with Random-projection Quantizer for Speech Recognition
Authors: Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, Yonghui Wu
Abstract summary: We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict masked speech signals, in the form of discrete labels. It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
Score: 51.24368930992091
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither the matrix nor the codebook is updated during self-supervised learning. Since the random-projection quantizer is not trained and is separated from the speech recognition model, the design makes the approach flexible and is compatible with universal speech recognition architecture. On LibriSpeech our approach achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models, and provides lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT with streaming models. On multilingual tasks the approach also provides significant improvement over wav2vec 2.0 and w2v-BERT.

Related papers

SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units. Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering. SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z)
Multilingual self-supervised speech representations improve the speech recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%. In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z)
A vector quantized masked autoencoder for audiovisual speech emotion recognition [5.8641712963450825]
VQ-MAE-AV is a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels.<n>The model is designed to extract both local (i.e., at the frame level) and global (i.e., at the sequence level) representations of audiovisual speech.<n>The proposed approach achieves state-of-the-art emotion recognition results across several datasets in both controlled and in-the-wild conditions.
arXiv Detail & Related papers (2023-05-05T14:19:46Z)
Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task. This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z)
Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes. The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z)
Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [20.316239155843963]
We propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks. On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset.
arXiv Detail & Related papers (2021-10-14T12:32:40Z)
Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech. Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network. In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z)
Continual-wav2vec2: an Application of Continual Learning for Self-Supervised Automatic Speech Recognition [0.23872611575805824]
We present a method for continual learning of speech representations for multiple languages using self-supervised learning (SSL) Wav2vec models perform SSL on raw audio in a pretraining phase and then finetune on a small fraction of annotated data. We use ideas from continual learning to transfer knowledge from a previous task to speed up pretraining a new language task.
arXiv Detail & Related papers (2021-07-26T10:39:03Z)
Fast End-to-End Speech Recognition via a Non-Autoregressive Model and Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once) The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
Investigation of Speaker-adaptation methods in Transformer based ASR [8.637110868126548]
This paper explores different ways of incorporating speaker information at the encoder input while training a transformer-based model to improve its speech recognition performance. We present speaker information in the form of speaker embeddings for each of the speakers. We obtain improvements in the word error rate over the baseline through our approach of integrating speaker embeddings into the model.
arXiv Detail & Related papers (2020-08-07T16:09:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.