A Holistic Cascade System, benchmark, and Human Evaluation Protocol for
Expressive Speech-to-Speech Translation
- URL: http://arxiv.org/abs/2301.10606v1
- Date: Wed, 25 Jan 2023 14:27:00 GMT
- Title: A Holistic Cascade System, benchmark, and Human Evaluation Protocol for
Expressive Speech-to-Speech Translation
- Authors: Wen-Chin Huang, Benjamin Peloquin, Justine Kao, Changhan Wang, Hongyu
Gong, Elizabeth Salesky, Yossi Adi, Ann Lee, Peng-Jen Chen
- Abstract summary: Expressive speech-to-speech translation (S2ST) aims to transfer prosodic attributes of source speech to target speech while maintaining translation accuracy.
Existing research in expressive S2ST is limited, typically focusing on a single expressivity aspect at a time.
We propose a holistic cascade system for expressive S2ST, combining multiple prosody transfer techniques previously considered only in isolation.
- Score: 45.47457657122893
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Expressive speech-to-speech translation (S2ST) aims to transfer prosodic
attributes of source speech to target speech while maintaining translation
accuracy. Existing research in expressive S2ST is limited, typically focusing
on a single expressivity aspect at a time. Likewise, this research area lacks
standard evaluation protocols and well-curated benchmark datasets. In this
work, we propose a holistic cascade system for expressive S2ST, combining
multiple prosody transfer techniques previously considered only in isolation.
We curate a benchmark expressivity test set in the TV series domain and
explored a second dataset in the audiobook domain. Finally, we present a human
evaluation protocol to assess multiple expressive dimensions across speech
pairs. Experimental results indicate that bi-lingual annotators can assess the
quality of expressive preservation in S2ST systems, and the holistic modeling
approach outperforms single-aspect systems. Audio samples can be accessed
through our demo webpage:
https://facebookresearch.github.io/speech_translation/cascade_expressive_s2st.
Related papers
- Enhancing Speech-to-Speech Translation with Multiple TTS Targets [62.18395387305803]
We analyze the effect of changing synthesized target speech for direct S2ST models.
We propose a multi-task framework that jointly optimized the S2ST system with multiple targets from different TTS systems.
arXiv Detail & Related papers (2023-04-10T14:33:33Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Textless Direct Speech-to-Speech Translation with Discrete Speech
Representation [27.182170555234226]
We propose a novel model, Textless Translatotron, for training an end-to-end direct S2ST model without any textual supervision.
When a speech encoder pre-trained with unsupervised speech data is used for both models, the proposed model obtains translation quality nearly on-par with Translatotron 2.
arXiv Detail & Related papers (2022-10-31T19:48:38Z) - Joint Pre-Training with Speech and Bilingual Text for Direct Speech to
Speech Translation [94.80029087828888]
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST.
Direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.
We propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks.
arXiv Detail & Related papers (2022-10-31T02:55:51Z) - Revisiting End-to-End Speech-to-Text Translation From Scratch [48.203394370942505]
End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks.
In this paper, we explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved.
arXiv Detail & Related papers (2022-06-09T15:39:19Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.