Enhancing expressivity transfer in textless speech-to-speech translation
- URL: http://arxiv.org/abs/2310.07279v1
- Date: Wed, 11 Oct 2023 08:07:22 GMT
- Title: Enhancing expressivity transfer in textless speech-to-speech translation
- Authors: Jarod Duret (LIA), Benjamin O'Brien (LIA), Yannick Est\`eve (LIA),
Titouan Parcollet (CAM)
- Abstract summary: Existing state-of-the-art systems fall short when it comes to capturing and transferring expressivity accurately across different languages.
This study presents a novel method that operates at the discrete speech unit level and leverages multilingual emotion embeddings.
We demonstrate how these embeddings can be used to effectively predict the pitch and duration of speech units in the target language.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Textless speech-to-speech translation systems are rapidly advancing, thanks
to the integration of self-supervised learning techniques. However, existing
state-of-the-art systems fall short when it comes to capturing and transferring
expressivity accurately across different languages. Expressivity plays a vital
role in conveying emotions, nuances, and cultural subtleties, thereby enhancing
communication across diverse languages. To address this issue this study
presents a novel method that operates at the discrete speech unit level and
leverages multilingual emotion embeddings to capture language-agnostic
information. Specifically, we demonstrate how these embeddings can be used to
effectively predict the pitch and duration of speech units in the target
language. Through objective and subjective experiments conducted on a
French-to-English translation task, our findings highlight the superior
expressivity transfer achieved by our approach compared to current
state-of-the-art systems.
Related papers
- Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation [23.757896930482342]
This work explores the selection process through a study of downstream tasks.
Units that perform well in resynthesis performance do not necessarily correlate with those that enhance translation efficacy.
arXiv Detail & Related papers (2024-07-08T08:53:26Z) - Controlling Emotion in Text-to-Speech with Natural Language Prompts [29.013577423045255]
We propose a system conditioned on embeddings derived from an emotionally rich text iteration that serves as prompt.
A joint representation of speaker and prompt embeddings is integrated at several points within a transformer-based architecture.
Our approach is trained on merged emotional speech and text datasets and varies prompts in each training to increase the generalization capabilities of the model.
arXiv Detail & Related papers (2024-06-10T15:58:42Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - CLARA: Multilingual Contrastive Learning for Audio Representation
Acquisition [5.520654376217889]
CLARA minimizes reliance on labelled data, enhancing generalization across languages.
Our approach adeptly captures emotional nuances in speech, overcoming subjective assessment issues.
It adapts to low-resource languages, marking progress in multilingual speech representation learning.
arXiv Detail & Related papers (2023-10-18T09:31:56Z) - TRAVID: An End-to-End Video Translation Framework [1.6131714685439382]
We present an end-to-end video translation system that not only translates spoken language but also synchronizes the translated speech with the lip movements of the speaker.
Our system focuses on translating educational lectures in various Indian languages, and it is designed to be effective even in low-resource system settings.
arXiv Detail & Related papers (2023-09-20T14:13:05Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - Learning Multilingual Expressive Speech Representation for Prosody
Prediction without Parallel Data [0.0]
We propose a method for speech-to-speech emotion translation that operates at the level of discrete speech units.
We show that this embedding can be used to predict the pitch and duration of speech units in a target language.
We evaluate our approach to English and French speech signals and show that it outperforms a baseline method.
arXiv Detail & Related papers (2023-06-29T08:06:54Z) - BabySLM: language-acquisition-friendly benchmark of self-supervised
spoken language models [56.93604813379634]
Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels.
We propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels.
We highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.
arXiv Detail & Related papers (2023-06-02T12:54:38Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Self-Supervised Representations Improve End-to-End Speech Translation [57.641761472372814]
We show that self-supervised pre-trained features can consistently improve the translation performance.
Cross-lingual transfer allows to extend to a variety of languages without or with little tuning.
arXiv Detail & Related papers (2020-06-22T10:28:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.