Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data
- URL: http://arxiv.org/abs/2203.17113v1
- Date: Thu, 31 Mar 2022 15:33:56 GMT
- Title: Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data
- Authors: Junyi Ao, Ziqiang Zhang, Long Zhou, Shujie Liu, Haizhou Li, Tom Ko,
Lirong Dai, Jinyu Li, Yao Qian, Furu Wei
- Abstract summary: We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
- Score: 145.95460945321253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies a novel pre-training technique with unpaired speech data,
Speech2C, for encoder-decoder based automatic speech recognition (ASR). Within
a multi-task learning framework, we introduce two pre-training tasks for the
encoder-decoder network using acoustic units, i.e., pseudo codes, derived from
an offline clustering model. One is to predict the pseudo codes via masked
language modeling in encoder output, like HuBERT model, while the other lets
the decoder learn to reconstruct pseudo codes autoregressively instead of
generating textual scripts. In this way, the decoder learns to reconstruct
original speech information with codes before learning to generate correct
text. Comprehensive experiments on the LibriSpeech corpus show that the
proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over
the method without decoder pre-training, and also outperforms significantly the
state-of-the-art wav2vec 2.0 and HuBERT on fine-tuning subsets of 10h and 100h.
Related papers
- DASpeech: Directed Acyclic Transformer for Fast and High-quality
Speech-to-Speech Translation [36.126810842258706]
Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model.
Due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution.
We propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST.
arXiv Detail & Related papers (2023-10-11T11:39:36Z) - RepCodec: A Speech Representation Codec for Speech Tokenization [21.60885344868044]
RepCodec is a novel representation for semantic speech tokenization.
We show that RepCodec significantly outperforms the widely used k-means clustering approach in both speech understanding and generation.
arXiv Detail & Related papers (2023-08-31T23:26:10Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - Linguistic-Enhanced Transformer with CTC Embedding for Speech
Recognition [29.1423215212174]
Recent emergence of joint CTC-Attention model shows significant improvement in automatic speech recognition (ASR)
We propose linguistic-enhanced transformer, which introduces refined CTC information to decoder during training process.
Experiments on AISHELL-1 speech corpus show that the character error rate (CER) is relatively reduced by up to 7%.
arXiv Detail & Related papers (2022-10-25T08:12:59Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.
The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Self-supervised Learning with Random-projection Quantizer for Speech
Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition.
The approach learns a model to predict masked speech signals, in the form of discrete labels.
It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.