Fluent and Low-latency Simultaneous Speech-to-Speech Translation with
Self-adaptive Training
- URL: http://arxiv.org/abs/2010.10048v2
- Date: Wed, 21 Oct 2020 19:12:17 GMT
- Title: Fluent and Low-latency Simultaneous Speech-to-Speech Translation with
Self-adaptive Training
- Authors: Renjie Zheng, Mingbo Ma, Baigong Zheng, Kaibo Liu, Jiahong Yuan,
Kenneth Church, Liang Huang
- Abstract summary: Simultaneous speech-to-speech translation is widely useful but extremely challenging.
It needs to generate target-language speech concurrently with the source-language speech, with only a few seconds delay.
Current approaches accumulate latencies progressively when the speaker talks faster, and introduce unnatural pauses when the speaker talks slower.
We propose Self-Adaptive Translation (SAT) which flexibly adjusts the length of translations to accommodate different source speech rates.
- Score: 40.71155396456831
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Simultaneous speech-to-speech translation is widely useful but extremely
challenging, since it needs to generate target-language speech concurrently
with the source-language speech, with only a few seconds delay. In addition, it
needs to continuously translate a stream of sentences, but all recent solutions
merely focus on the single-sentence scenario. As a result, current approaches
accumulate latencies progressively when the speaker talks faster, and introduce
unnatural pauses when the speaker talks slower. To overcome these issues, we
propose Self-Adaptive Translation (SAT) which flexibly adjusts the length of
translations to accommodate different source speech rates. At similar levels of
translation quality (as measured by BLEU), our method generates more fluent
target speech (as measured by the naturalness metric MOS) with substantially
lower latency than the baseline, in both Zh <-> En directions.
Related papers
- StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning [48.84039953531356]
StreamSpeech is a direct Simul-S2ST model that jointly learns translation and simultaneous policy.
Experiments on CVSS benchmark demonstrate that StreamSpeech achieves state-of-the-art performance in both offline S2ST and Simul-S2ST tasks.
arXiv Detail & Related papers (2024-06-05T08:24:22Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - TranSentence: Speech-to-speech Translation via Language-agnostic
Sentence-level Speech Encoding without Language-parallel Data [44.83532231917504]
TranSentence is a novel speech-to-speech translation without language-parallel speech data.
We train our model to generate speech based on the encoded embedding obtained from a language-agnostic sentence-level speech encoder.
We extend TranSentence to multilingual speech-to-speech translation.
arXiv Detail & Related papers (2024-01-17T11:52:40Z) - Jointly Optimizing Translations and Speech Timing to Improve Isochrony
in Automatic Dubbing [71.02335065794384]
We propose a model that directly optimize both the translation as well as the speech duration of the generated translations.
We show that this system generates speech that better matches the timing of the original speech, compared to prior work, while simplifying the system architecture.
arXiv Detail & Related papers (2023-02-25T04:23:25Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Towards Natural and Controllable Cross-Lingual Voice Conversion Based on
Neural TTS Model and Phonetic Posteriorgram [21.652906261475533]
Cross-lingual voice conversion is a challenging problem due to significant mismatches of the phonetic set and the speech prosody of different languages.
We build upon the neural text-to-speech (TTS) model to design a new cross-lingual VC framework named FastSpeech-VC.
arXiv Detail & Related papers (2021-02-03T10:28:07Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.