Related papers: TranSentence: Speech-to-speech Translation via Language-agnostic Sentence-level Speech Encoding without Language-parallel Data

TranSentence: Speech-to-speech Translation via Language-agnostic Sentence-level Speech Encoding without Language-parallel Data

URL: http://arxiv.org/abs/2401.12992v1
Date: Wed, 17 Jan 2024 11:52:40 GMT
Title: TranSentence: Speech-to-speech Translation via Language-agnostic Sentence-level Speech Encoding without Language-parallel Data
Authors: Seung-Bin Kim, Sang-Hoon Lee, Seong-Whan Lee
Abstract summary: TranSentence is a novel speech-to-speech translation without language-parallel speech data. We train our model to generate speech based on the encoded embedding obtained from a language-agnostic sentence-level speech encoder. We extend TranSentence to multilingual speech-to-speech translation.
Score: 44.83532231917504
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Although there has been significant advancement in the field of speech-to-speech translation, conventional models still require language-parallel speech data between the source and target languages for training. In this paper, we introduce TranSentence, a novel speech-to-speech translation without language-parallel speech data. To achieve this, we first adopt a language-agnostic sentence-level speech encoding that captures the semantic information of speech, irrespective of language. We then train our model to generate speech based on the encoded embedding obtained from a language-agnostic sentence-level speech encoder that is pre-trained with various languages. With this method, despite training exclusively on the target language's monolingual data, we can generate target language speech in the inference stage using language-agnostic speech embedding from the source language speech. Furthermore, we extend TranSentence to multilingual speech-to-speech translation. The experimental results demonstrate that TranSentence is superior to other models.

Related papers

Cross-Lingual Transfer Learning for Speech Translation [7.802021866251242]
This paper examines how to expand the speech translation capability of speech foundation models with restricted data. Whisper, a speech foundation model with strong performance on speech recognition and English translation, is used as the example model. Using speech-to-speech retrieval to analyse the audio representations generated by the encoder, we show that utterances from different languages are mapped to a shared semantic space.
arXiv Detail & Related papers (2024-07-01T09:51:48Z)
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion. We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z)
Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech. We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z)
Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z)
Code-Switching without Switching: Language Agnostic End-to-End Speech Translation [68.8204255655161]
We treat speech recognition and translation as one unified end-to-end speech translation problem. By training LAST with both input languages, we decode speech into one target language, regardless of the input language.
arXiv Detail & Related papers (2022-10-04T10:34:25Z)
LibriS2S: A German-English Speech-to-Speech Translation Corpus [12.376309678270275]
We create the first publicly available speech-to-speech training corpus between German and English. This allows the creation of a new text-to-speech and speech-to-speech translation model. We propose Text-to-Speech models based on the example of the recently proposed FastSpeech 2 model.
arXiv Detail & Related papers (2022-04-22T09:33:31Z)
UWSpeech: Speech to Speech Translation for Unwritten Languages [145.37116196042282]
We develop a translation system for unwritten languages, named as UWSpeech, which converts target unwritten speech into discrete tokens with a converter. We propose a method called XL-VAE, which enhances vector quantized variational autoencoder (VQ-VAE) with cross-lingual (XL) speech recognition. Experiments on Fisher Spanish-English conversation translation dataset show that UWSpeech outperforms direct translation and VQ-VAE baseline by about 16 and 10 BLEU points respectively.
arXiv Detail & Related papers (2020-06-14T15:22:12Z)
Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario [10.779568857641928]
This paper presents an extension on Tacotron2 to achieve bilingual multispeaker speech synthesis. We achieve cross-lingual synthesis, including code-switching cases, between English and Mandarin for monolingual speakers.
arXiv Detail & Related papers (2020-05-21T03:03:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.