Many-to-Many Spoken Language Translation via Unified Speech and Text
Representation Learning with Unit-to-Unit Translation
- URL: http://arxiv.org/abs/2308.01831v1
- Date: Thu, 3 Aug 2023 15:47:04 GMT
- Title: Many-to-Many Spoken Language Translation via Unified Speech and Text
Representation Learning with Unit-to-Unit Translation
- Authors: Minsu Kim, Jeongsoo Choi, Dahun Kim, Yong Man Ro
- Abstract summary: We represent multilingual speech audio with speech units, the quantized representations of speech features encoded from a self-supervised speech model.
Then, we propose to train an encoder-decoder structured model with a Unit-to-Unit Translation (UTUT) objective on multilingual data.
A single pre-trained model with UTUT can be employed for diverse multilingual speech- and text-related tasks, such as Speech-to-Speech Translation (STS), multilingual Text-to-Speech Synthesis (TTS), and Text-to-Speech Translation (TTST)
- Score: 39.74625363642717
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a method to learn unified representations of
multilingual speech and text with a single model, especially focusing on the
purpose of speech synthesis. We represent multilingual speech audio with speech
units, the quantized representations of speech features encoded from a
self-supervised speech model. Therefore, we can focus on their linguistic
content by treating the audio as pseudo text and can build a unified
representation of speech and text. Then, we propose to train an encoder-decoder
structured model with a Unit-to-Unit Translation (UTUT) objective on
multilingual data. Specifically, by conditioning the encoder with the source
language token and the decoder with the target language token, the model is
optimized to translate the spoken language into that of the target language, in
a many-to-many language translation setting. Therefore, the model can build the
knowledge of how spoken languages are comprehended and how to relate them to
different languages. A single pre-trained model with UTUT can be employed for
diverse multilingual speech- and text-related tasks, such as Speech-to-Speech
Translation (STS), multilingual Text-to-Speech Synthesis (TTS), and
Text-to-Speech Translation (TTST). By conducting comprehensive experiments
encompassing various languages, we validate the efficacy of the proposed method
across diverse multilingual tasks. Moreover, we show UTUT can perform
many-to-many language STS, which has not been previously explored in the
literature. Samples are available on https://choijeongsoo.github.io/utut.
Related papers
- AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Unified Speech-Text Pre-training for Speech Translation and Recognition [113.31415771943162]
We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition.
The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning.
It achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MuST-C speech translation dataset.
arXiv Detail & Related papers (2022-04-11T20:59:51Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario [10.779568857641928]
This paper presents an extension on Tacotron2 to achieve bilingual multispeaker speech synthesis.
We achieve cross-lingual synthesis, including code-switching cases, between English and Mandarin for monolingual speakers.
arXiv Detail & Related papers (2020-05-21T03:03:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.