Related papers: Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training

Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training

URL: http://arxiv.org/abs/2010.10048v2
Date: Wed, 21 Oct 2020 19:12:17 GMT
Title: Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training
Authors: Renjie Zheng, Mingbo Ma, Baigong Zheng, Kaibo Liu, Jiahong Yuan, Kenneth Church, Liang Huang
Abstract summary: Simultaneous speech-to-speech translation is widely useful but extremely challenging. It needs to generate target-language speech concurrently with the source-language speech, with only a few seconds delay. Current approaches accumulate latencies progressively when the speaker talks faster, and introduce unnatural pauses when the speaker talks slower. We propose Self-Adaptive Translation (SAT) which flexibly adjusts the length of translations to accommodate different source speech rates.
Score: 40.71155396456831
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Simultaneous speech-to-speech translation is widely useful but extremely challenging, since it needs to generate target-language speech concurrently with the source-language speech, with only a few seconds delay. In addition, it needs to continuously translate a stream of sentences, but all recent solutions merely focus on the single-sentence scenario. As a result, current approaches accumulate latencies progressively when the speaker talks faster, and introduce unnatural pauses when the speaker talks slower. To overcome these issues, we propose Self-Adaptive Translation (SAT) which flexibly adjusts the length of translations to accommodate different source speech rates. At similar levels of translation quality (as measured by BLEU), our method generates more fluent target speech (as measured by the naturalness metric MOS) with substantially lower latency than the baseline, in both Zh <-> En directions.

Related papers

High-Fidelity Simultaneous Speech-To-Speech Translation [75.69884829562591]
We introduce Hibiki, a decoder-only model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation.
arXiv Detail & Related papers (2025-02-05T17:18:55Z)
StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning [48.84039953531356]
StreamSpeech is a direct Simul-S2ST model that jointly learns translation and simultaneous policy. Experiments on CVSS benchmark demonstrate that StreamSpeech achieves state-of-the-art performance in both offline S2ST and Simul-S2ST tasks.
arXiv Detail & Related papers (2024-06-05T08:24:22Z)
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion. We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z)
TranSentence: Speech-to-speech Translation via Language-agnostic Sentence-level Speech Encoding without Language-parallel Data [44.83532231917504]
TranSentence is a novel speech-to-speech translation without language-parallel speech data. We train our model to generate speech based on the encoded embedding obtained from a language-agnostic sentence-level speech encoder. We extend TranSentence to multilingual speech-to-speech translation.
arXiv Detail & Related papers (2024-01-17T11:52:40Z)
Jointly Optimizing Translations and Speech Timing to Improve Isochrony in Automatic Dubbing [71.02335065794384]
We propose a model that directly optimize both the translation as well as the speech duration of the generated translations. We show that this system generates speech that better matches the timing of the original speech, compared to prior work, while simplifying the system architecture.
arXiv Detail & Related papers (2023-02-25T04:23:25Z)
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation. We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices. TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z)
Towards Natural and Controllable Cross-Lingual Voice Conversion Based on Neural TTS Model and Phonetic Posteriorgram [21.652906261475533]
Cross-lingual voice conversion is a challenging problem due to significant mismatches of the phonetic set and the speech prosody of different languages. We build upon the neural text-to-speech (TTS) model to design a new cross-lingual VC framework named FastSpeech-VC.
arXiv Detail & Related papers (2021-02-03T10:28:07Z)
Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously. We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.