Injecting Text in Self-Supervised Speech Pretraining
- URL: http://arxiv.org/abs/2108.12226v1
- Date: Fri, 27 Aug 2021 11:36:40 GMT
- Title: Injecting Text in Self-Supervised Speech Pretraining
- Authors: Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Gary
Wang, Pedro Moreno
- Abstract summary: We propose to jointly learn representations during pretraining from two different modalities: speech and text.
tts4pretrain complements the power of contrastive learning in self-supervision.
We demonstrate Word Error Rate (WER) reductions of 10% relative on the well-benchmarked, Librispeech task.
- Score: 33.676479965610774
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised pretraining for Automated Speech Recognition (ASR) has shown
varied degrees of success. In this paper, we propose to jointly learn
representations during pretraining from two different modalities: speech and
text. The proposed method, tts4pretrain complements the power of contrastive
learning in self-supervision with linguistic/lexical representations derived
from synthesized speech, effectively learning from untranscribed speech and
unspoken text. Lexical learning in the speech encoder is enforced through an
additional sequence loss term that is coupled with contrastive loss during
pretraining. We demonstrate that this novel pretraining method yields Word
Error Rate (WER) reductions of 10% relative on the well-benchmarked,
Librispeech task over a state-of-the-art baseline pretrained with wav2vec2.0
only. The proposed method also serves as an effective strategy to compensate
for the lack of transcribed speech, effectively matching the performance of
5000 hours of transcribed speech with just 100 hours of transcribed speech on
the AMI meeting transcription task. Finally, we demonstrate WER reductions of
up to 15% on an in-house Voice Search task over traditional pretraining.
Incorporating text into encoder pretraining is complimentary to rescoring with
a larger or in-domain language model, resulting in additional 6% relative
reduction in WER.
Related papers
- Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning [23.907448315388294]
We propose an alternative method to leverage transcribed speech audio as an additional training source, based on multi-task learning (MTL)
Experiments show that, compared to a baseline MTL-based method, the proposed MTL-based method reduces PER from 2.5% to 1.6% for those word types covered exclusively in transcribed speech audio.
arXiv Detail & Related papers (2024-09-15T23:00:54Z) - End-to-End Speech Recognition Contextualization with Large Language
Models [25.198480789044346]
We introduce a novel method for contextualizing speech recognition models incorporating Large Language Models (LLMs)
We provide audio features, along with optional text tokens for context, to train the system to complete transcriptions in a decoder-only fashion.
Our empirical results demonstrate a significant improvement in performance, with a 6% WER reduction when additional textual context is provided.
arXiv Detail & Related papers (2023-09-19T20:28:57Z) - token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired
Speech and Text [65.04385919645395]
token2vec is a novel joint pre-training framework for unpaired speech and text based on discrete representations of speech.
Experiments show that token2vec is significantly superior to various speech-only pre-training baselines, with up to 17.7% relative WER reduction.
arXiv Detail & Related papers (2022-10-30T06:38:19Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text
Joint Pre-Training [33.02912456062474]
We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech.
We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST2 speech translation.
arXiv Detail & Related papers (2021-10-20T00:59:36Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - UniSpeech: Unified Speech Representation Learning with Labeled and
Unlabeled Data [54.733889961024445]
We propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data.
We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus.
arXiv Detail & Related papers (2021-01-19T12:53:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.