W2v-BERT: Combining Contrastive Learning and Masked Language Modeling
for Self-Supervised Speech Pre-Training
- URL: http://arxiv.org/abs/2108.06209v1
- Date: Sat, 7 Aug 2021 06:29:36 GMT
- Title: W2v-BERT: Combining Contrastive Learning and Masked Language Modeling
for Self-Supervised Speech Pre-Training
- Authors: Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming
Pang, Yonghui Wu
- Abstract summary: w2v-BERT is a framework that combines contrastive learning and pre-supervised speech learning.
Our experiments show that w2v-BERT achieves competitive results compared to current state-of-the-art pre-trained models.
- Score: 49.47516627019855
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motivated by the success of masked language modeling~(MLM) in pre-training
natural language processing models, we propose w2v-BERT that explores MLM for
self-supervised speech representation learning. w2v-BERT is a framework that
combines contrastive learning and MLM, where the former trains the model to
discretize input continuous speech signals into a finite set of discriminative
speech tokens, and the latter trains the model to learn contextualized speech
representations via solving a masked prediction task consuming the discretized
tokens. In contrast to existing MLM-based speech pre-training frameworks such
as HuBERT, which relies on an iterative re-clustering and re-training process,
or vq-wav2vec, which concatenates two separately trained modules, w2v-BERT can
be optimized in an end-to-end fashion by solving the two self-supervised
tasks~(the contrastive task and MLM) simultaneously. Our experiments show that
w2v-BERT achieves competitive results compared to current state-of-the-art
pre-trained models on the LibriSpeech benchmarks when using the Libri-Light~60k
corpus as the unsupervised data. In particular, when compared to published
models such as conformer-based wav2vec~2.0 and HuBERT, our model shows~5\%
to~10\% relative WER reduction on the test-clean and test-other subsets. When
applied to the Google's Voice Search traffic dataset, w2v-BERT outperforms our
internal conformer-based wav2vec~2.0 by more than~30\% relatively.
Related papers
- SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction [65.1590372072555]
We introduce SHuBERT, a self-supervised transformer encoder that learns strong representations from American Sign Language (ASL) video content.
Inspired by the success of the HuBERT speech representation model, SHuBERT adapts masked prediction for multi-stream visual sign language input.
SHuBERT achieves state-of-the-art performance across multiple benchmarks.
arXiv Detail & Related papers (2024-11-25T03:13:08Z) - Comparing Discrete and Continuous Space LLMs for Speech Recognition [46.70297458685438]
This paper investigates discrete and continuous speech representations in Large Language Model (LLM)-based Automatic Speech Recognition (ASR)
We classify LLMs based on their input and autoregressive feedback into continuous and discrete-space models.
We present an open-sourced achievement of a state-of-the-art Word Error Rate (WER) of 1.69% on LibriSpeech using a HuBERT encoder.
arXiv Detail & Related papers (2024-09-01T18:29:45Z) - MooER: LLM-based Speech Recognition and Translation Models from Moore Threads [13.02816167879662]
MooER is a large-scale automatic speech recognition (ASR) / automatic speech translation (AST) model of Moore Threads.
A 5000h pseudo labeled dataset containing open source and self collected speech data is used for training.
Experiments conducted on Covost2 Zh2en testset suggest that our model outperforms other open source Speech LLMs.
arXiv Detail & Related papers (2024-08-09T14:43:56Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Self-supervised Learning with Random-projection Quantizer for Speech
Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition.
The approach learns a model to predict masked speech signals, in the form of discrete labels.
It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z) - SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text
Joint Pre-Training [33.02912456062474]
We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech.
We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST2 speech translation.
arXiv Detail & Related papers (2021-10-20T00:59:36Z) - ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken
Language Understanding [23.367329217151084]
We introduce a cross-modal pre-trained language model, called Speech-Text BERT (ST-BERT), to tackle end-to-end spoken language understanding tasks.
Taking phoneme posterior and subword-level text as an input, ST-BERT learns a contextualized cross-modal alignment.
Our method shows further SLU performance gain via domain-adaptive pre-training with domain-specific speech-text pair data.
arXiv Detail & Related papers (2020-10-23T10:28:20Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.