Direct Text to Speech Translation System using Acoustic Units
- URL: http://arxiv.org/abs/2309.07478v1
- Date: Thu, 14 Sep 2023 07:35:14 GMT
- Title: Direct Text to Speech Translation System using Acoustic Units
- Authors: Victoria Mingote, Pablo Gimeno, Luis Vicente, Sameer Khurana, Antoine
Laurent, Jarod Duret
- Abstract summary: This paper proposes a direct text to speech translation system using discrete acoustic units.
This framework employs text in different source languages as input to generate speech in the target language without the need for text transcriptions in this language.
Results show a remarkable improvement when initialising our proposed architecture with a model pre-trained with more languages.
- Score: 12.36988942647101
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper proposes a direct text to speech translation system using discrete
acoustic units. This framework employs text in different source languages as
input to generate speech in the target language without the need for text
transcriptions in this language. Motivated by the success of acoustic units in
previous works for direct speech to speech translation systems, we use the same
pipeline to extract the acoustic units using a speech encoder combined with a
clustering algorithm. Once units are obtained, an encoder-decoder architecture
is trained to predict them. Then a vocoder generates speech from units. Our
approach for direct text to speech translation was tested on the new CVSS
corpus with two different text mBART models employed as initialisation. The
systems presented report competitive performance for most of the language pairs
evaluated. Besides, results show a remarkable improvement when initialising our
proposed architecture with a model pre-trained with more languages.
Related papers
- Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Direct simultaneous speech to speech translation [29.958601064888132]
We present the first direct simultaneous speech-to-speech translation (Simul-S2ST) model.
The model can start generating translation in the target speech before consuming the full source speech content.
arXiv Detail & Related papers (2021-10-15T17:59:15Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.