TranSentence: Speech-to-speech Translation via Language-agnostic
Sentence-level Speech Encoding without Language-parallel Data
- URL: http://arxiv.org/abs/2401.12992v1
- Date: Wed, 17 Jan 2024 11:52:40 GMT
- Title: TranSentence: Speech-to-speech Translation via Language-agnostic
Sentence-level Speech Encoding without Language-parallel Data
- Authors: Seung-Bin Kim, Sang-Hoon Lee, Seong-Whan Lee
- Abstract summary: TranSentence is a novel speech-to-speech translation without language-parallel speech data.
We train our model to generate speech based on the encoded embedding obtained from a language-agnostic sentence-level speech encoder.
We extend TranSentence to multilingual speech-to-speech translation.
- Score: 44.83532231917504
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Although there has been significant advancement in the field of
speech-to-speech translation, conventional models still require
language-parallel speech data between the source and target languages for
training. In this paper, we introduce TranSentence, a novel speech-to-speech
translation without language-parallel speech data. To achieve this, we first
adopt a language-agnostic sentence-level speech encoding that captures the
semantic information of speech, irrespective of language. We then train our
model to generate speech based on the encoded embedding obtained from a
language-agnostic sentence-level speech encoder that is pre-trained with
various languages. With this method, despite training exclusively on the target
language's monolingual data, we can generate target language speech in the
inference stage using language-agnostic speech embedding from the source
language speech. Furthermore, we extend TranSentence to multilingual
speech-to-speech translation. The experimental results demonstrate that
TranSentence is superior to other models.
Related papers
- Cross-Lingual Transfer Learning for Speech Translation [7.802021866251242]
This paper examines how to expand the speech translation capability of speech foundation models with restricted data.
Whisper, a speech foundation model with strong performance on speech recognition and English translation, is used as the example model.
Using speech-to-speech retrieval to analyse the audio representations generated by the encoder, we show that utterances from different languages are mapped to a shared semantic space.
arXiv Detail & Related papers (2024-07-01T09:51:48Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec
Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis.
VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks.
It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z) - Code-Switching without Switching: Language Agnostic End-to-End Speech
Translation [68.8204255655161]
We treat speech recognition and translation as one unified end-to-end speech translation problem.
By training LAST with both input languages, we decode speech into one target language, regardless of the input language.
arXiv Detail & Related papers (2022-10-04T10:34:25Z) - LibriS2S: A German-English Speech-to-Speech Translation Corpus [12.376309678270275]
We create the first publicly available speech-to-speech training corpus between German and English.
This allows the creation of a new text-to-speech and speech-to-speech translation model.
We propose Text-to-Speech models based on the example of the recently proposed FastSpeech 2 model.
arXiv Detail & Related papers (2022-04-22T09:33:31Z) - UWSpeech: Speech to Speech Translation for Unwritten Languages [145.37116196042282]
We develop a translation system for unwritten languages, named as UWSpeech, which converts target unwritten speech into discrete tokens with a converter.
We propose a method called XL-VAE, which enhances vector quantized variational autoencoder (VQ-VAE) with cross-lingual (XL) speech recognition.
Experiments on Fisher Spanish-English conversation translation dataset show that UWSpeech outperforms direct translation and VQ-VAE baseline by about 16 and 10 BLEU points respectively.
arXiv Detail & Related papers (2020-06-14T15:22:12Z) - Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario [10.779568857641928]
This paper presents an extension on Tacotron2 to achieve bilingual multispeaker speech synthesis.
We achieve cross-lingual synthesis, including code-switching cases, between English and Mandarin for monolingual speakers.
arXiv Detail & Related papers (2020-05-21T03:03:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.