TranSentence: Speech-to-speech Translation via Language-agnostic
Sentence-level Speech Encoding without Language-parallel Data
- URL: http://arxiv.org/abs/2401.12992v1
- Date: Wed, 17 Jan 2024 11:52:40 GMT
- Title: TranSentence: Speech-to-speech Translation via Language-agnostic
Sentence-level Speech Encoding without Language-parallel Data
- Authors: Seung-Bin Kim, Sang-Hoon Lee, Seong-Whan Lee
- Abstract summary: TranSentence is a novel speech-to-speech translation without language-parallel speech data.
We train our model to generate speech based on the encoded embedding obtained from a language-agnostic sentence-level speech encoder.
We extend TranSentence to multilingual speech-to-speech translation.
- Score: 44.83532231917504
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Although there has been significant advancement in the field of
speech-to-speech translation, conventional models still require
language-parallel speech data between the source and target languages for
training. In this paper, we introduce TranSentence, a novel speech-to-speech
translation without language-parallel speech data. To achieve this, we first
adopt a language-agnostic sentence-level speech encoding that captures the
semantic information of speech, irrespective of language. We then train our
model to generate speech based on the encoded embedding obtained from a
language-agnostic sentence-level speech encoder that is pre-trained with
various languages. With this method, despite training exclusively on the target
language's monolingual data, we can generate target language speech in the
inference stage using language-agnostic speech embedding from the source
language speech. Furthermore, we extend TranSentence to multilingual
speech-to-speech translation. The experimental results demonstrate that
TranSentence is superior to other models.
Related papers
- Cross-Lingual Transfer Learning for Speech Translation [7.802021866251242]
Zero-shot cross-lingual transfer has been demonstrated on a range of NLP tasks.
We explore whether speech-based models exhibit the same transfer capability.
arXiv Detail & Related papers (2024-07-01T09:51:48Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec
Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis.
VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks.
It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z) - Code-Switching without Switching: Language Agnostic End-to-End Speech
Translation [68.8204255655161]
We treat speech recognition and translation as one unified end-to-end speech translation problem.
By training LAST with both input languages, we decode speech into one target language, regardless of the input language.
arXiv Detail & Related papers (2022-10-04T10:34:25Z) - LibriS2S: A German-English Speech-to-Speech Translation Corpus [12.376309678270275]
We create the first publicly available speech-to-speech training corpus between German and English.
This allows the creation of a new text-to-speech and speech-to-speech translation model.
We propose Text-to-Speech models based on the example of the recently proposed FastSpeech 2 model.
arXiv Detail & Related papers (2022-04-22T09:33:31Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - UWSpeech: Speech to Speech Translation for Unwritten Languages [145.37116196042282]
We develop a translation system for unwritten languages, named as UWSpeech, which converts target unwritten speech into discrete tokens with a converter.
We propose a method called XL-VAE, which enhances vector quantized variational autoencoder (VQ-VAE) with cross-lingual (XL) speech recognition.
Experiments on Fisher Spanish-English conversation translation dataset show that UWSpeech outperforms direct translation and VQ-VAE baseline by about 16 and 10 BLEU points respectively.
arXiv Detail & Related papers (2020-06-14T15:22:12Z) - Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario [10.779568857641928]
This paper presents an extension on Tacotron2 to achieve bilingual multispeaker speech synthesis.
We achieve cross-lingual synthesis, including code-switching cases, between English and Mandarin for monolingual speakers.
arXiv Detail & Related papers (2020-05-21T03:03:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.