Enhancing Speech-to-Speech Translation with Multiple TTS Targets
- URL: http://arxiv.org/abs/2304.04618v1
- Date: Mon, 10 Apr 2023 14:33:33 GMT
- Title: Enhancing Speech-to-Speech Translation with Multiple TTS Targets
- Authors: Jiatong Shi, Yun Tang, Ann Lee, Hirofumi Inaguma, Changhan Wang, Juan
Pino, Shinji Watanabe
- Abstract summary: We analyze the effect of changing synthesized target speech for direct S2ST models.
We propose a multi-task framework that jointly optimized the S2ST system with multiple targets from different TTS systems.
- Score: 62.18395387305803
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It has been known that direct speech-to-speech translation (S2ST) models
usually suffer from the data scarcity issue because of the limited existing
parallel materials for both source and target speech. Therefore to train a
direct S2ST system, previous works usually utilize text-to-speech (TTS) systems
to generate samples in the target language by augmenting the data from
speech-to-text translation (S2TT). However, there is a limited investigation
into how the synthesized target speech would affect the S2ST models. In this
work, we analyze the effect of changing synthesized target speech for direct
S2ST models. We find that simply combining the target speech from different TTS
systems can potentially improve the S2ST performances. Following that, we also
propose a multi-task framework that jointly optimizes the S2ST system with
multiple targets from different TTS systems. Extensive experiments demonstrate
that our proposed framework achieves consistent improvements (2.8 BLEU) over
the baselines on the Fisher Spanish-English dataset.
Related papers
- Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? [49.42189569058647]
Two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS)
In this paper, we introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model.
We also propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data.
arXiv Detail & Related papers (2024-06-11T14:17:12Z) - Joint Pre-Training with Speech and Bilingual Text for Direct Speech to
Speech Translation [94.80029087828888]
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST.
Direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.
We propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks.
arXiv Detail & Related papers (2022-10-31T02:55:51Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - Incremental Speech Synthesis For Speech-To-Speech Translation [23.951060578077445]
We focus on improving the incremental synthesis performance of TTS models.
With a simple data augmentation strategy based on prefixes, we are able to improve the incremental TTS quality to approach offline performance.
We propose latency metrics tailored to S2ST applications, and investigate methods for latency reduction in this context.
arXiv Detail & Related papers (2021-10-15T17:20:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.