SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language
Processing
- URL: http://arxiv.org/abs/2110.07205v1
- Date: Thu, 14 Oct 2021 07:59:27 GMT
- Title: SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language
Processing
- Authors: Junyi Ao, Rui Wang, Long Zhou, Shujie Liu, Shuo Ren, Yu Wu, Tom Ko,
Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei
- Abstract summary: We propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.
The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets.
To align the textual and speech information into a unified semantic space, we propose a cross-modal vector quantization method with random mixing-up to bridge speech and text.
- Score: 77.4527868307914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motivated by the success of T5 (Text-To-Text Transfer Transformer) in
pre-training natural language processing models, we propose a unified-modal
SpeechT5 framework that explores the encoder-decoder pre-training for
self-supervised speech/text representation learning. The SpeechT5 framework
consists of a shared encoder-decoder network and six modal-specific
(speech/text) pre/post-nets. After preprocessing the speech/text input through
the pre-nets, the shared encoder-decoder network models the sequence to
sequence transformation, and then the post-nets generate the output in the
speech/text modality based on the decoder output. Particularly, SpeechT5 can
pre-train on a large scale of unlabeled speech and text data to improve the
capability of the speech and textual modeling. To align the textual and speech
information into a unified semantic space, we propose a cross-modal vector
quantization method with random mixing-up to bridge speech and text. Extensive
evaluations on a wide variety of spoken language processing tasks, including
voice conversion, automatic speech recognition, text to speech, and speaker
identification, show the superiority of the proposed SpeechT5 framework.
Related papers
- Scaling Speech-Text Pre-training with Synthetic Interleaved Data [31.77653849518526]
Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction.
Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data.
We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora.
arXiv Detail & Related papers (2024-11-26T17:19:09Z) - MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech
Recognition [75.12948999653338]
We propose a novel multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR)
We employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data.
Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
arXiv Detail & Related papers (2022-11-29T13:16:09Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.