UWSpeech: Speech to Speech Translation for Unwritten Languages
- URL: http://arxiv.org/abs/2006.07926v2
- Date: Thu, 17 Dec 2020 12:39:06 GMT
- Title: UWSpeech: Speech to Speech Translation for Unwritten Languages
- Authors: Chen Zhang, Xu Tan, Yi Ren, Tao Qin, Kejun Zhang, Tie-Yan Liu
- Abstract summary: We develop a translation system for unwritten languages, named as UWSpeech, which converts target unwritten speech into discrete tokens with a converter.
We propose a method called XL-VAE, which enhances vector quantized variational autoencoder (VQ-VAE) with cross-lingual (XL) speech recognition.
Experiments on Fisher Spanish-English conversation translation dataset show that UWSpeech outperforms direct translation and VQ-VAE baseline by about 16 and 10 BLEU points respectively.
- Score: 145.37116196042282
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing speech to speech translation systems heavily rely on the text of
target language: they usually translate source language either to target text
and then synthesize target speech from text, or directly to target speech with
target text for auxiliary training. However, those methods cannot be applied to
unwritten target languages, which have no written text or phoneme available. In
this paper, we develop a translation system for unwritten languages, named as
UWSpeech, which converts target unwritten speech into discrete tokens with a
converter, and then translates source-language speech into target discrete
tokens with a translator, and finally synthesizes target speech from target
discrete tokens with an inverter. We propose a method called XL-VAE, which
enhances vector quantized variational autoencoder (VQ-VAE) with cross-lingual
(XL) speech recognition, to train the converter and inverter of UWSpeech
jointly. Experiments on Fisher Spanish-English conversation translation dataset
show that UWSpeech outperforms direct translation and VQ-VAE baseline by about
16 and 10 BLEU points respectively, which demonstrate the advantages and
potentials of UWSpeech.
Related papers
- TranSentence: Speech-to-speech Translation via Language-agnostic
Sentence-level Speech Encoding without Language-parallel Data [44.83532231917504]
TranSentence is a novel speech-to-speech translation without language-parallel speech data.
We train our model to generate speech based on the encoded embedding obtained from a language-agnostic sentence-level speech encoder.
We extend TranSentence to multilingual speech-to-speech translation.
arXiv Detail & Related papers (2024-01-17T11:52:40Z) - Direct Text to Speech Translation System using Acoustic Units [12.36988942647101]
This paper proposes a direct text to speech translation system using discrete acoustic units.
This framework employs text in different source languages as input to generate speech in the target language without the need for text transcriptions in this language.
Results show a remarkable improvement when initialising our proposed architecture with a model pre-trained with more languages.
arXiv Detail & Related papers (2023-09-14T07:35:14Z) - SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language
Models [58.996653700982556]
Existing speech tokens are not specifically designed for speech language modeling.
We propose SpeechTokenizer, a unified speech tokenizer for speech large language models.
Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark.
arXiv Detail & Related papers (2023-08-31T12:53:09Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec
Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis.
VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks.
It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.