Duplex Diffusion Models Improve Speech-to-Speech Translation
- URL: http://arxiv.org/abs/2305.12628v1
- Date: Mon, 22 May 2023 01:39:40 GMT
- Title: Duplex Diffusion Models Improve Speech-to-Speech Translation
- Authors: Xianchao Wu
- Abstract summary: Speech-to-speech translation is a sequence-to-sequence learning task that naturally has two directions.
We propose a duplex diffusion model that applies diffusion probabilistic models to both sides of a reversible duplex Conformer.
Our model enables reversible speech translation by simply flipping the input and output ends.
- Score: 1.4649095013539173
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Speech-to-speech translation is a typical sequence-to-sequence learning task
that naturally has two directions. How to effectively leverage bidirectional
supervision signals to produce high-fidelity audio for both directions?
Existing approaches either train two separate models or a multitask-learned
model with low efficiency and inferior performance. In this paper, we propose a
duplex diffusion model that applies diffusion probabilistic models to both
sides of a reversible duplex Conformer, so that either end can simultaneously
input and output a distinct language's speech. Our model enables reversible
speech translation by simply flipping the input and output ends. Experiments
show that our model achieves the first success of reversible speech translation
with significant improvements of ASR-BLEU scores compared with a list of
state-of-the-art baselines.
Related papers
- SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer [53.72998363956454]
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy.
The scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation.
We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and timbre units.
arXiv Detail & Related papers (2023-09-14T09:52:08Z) - Minimally-Supervised Speech Synthesis with Conditional Diffusion Model
and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems.
We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability.
Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z) - DiffVoice: Text-to-Speech with Latent Diffusion [18.150627638754923]
We present DiffVoice, a novel text-to-speech model based on latent diffusion.
Subjective evaluations on LJSpeech and LibriTTS datasets demonstrate that our method beats the best publicly available systems in naturalness.
arXiv Detail & Related papers (2023-04-23T21:05:33Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Tight Integrated End-to-End Training for Cascaded Speech Translation [40.76367623739673]
A cascaded speech translation model relies on discrete and non-differentiable transcription.
Direct speech translation is an alternative method to avoid error propagation.
This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model.
arXiv Detail & Related papers (2020-11-24T15:43:49Z) - Two-Way Neural Machine Translation: A Proof of Concept for Bidirectional
Translation Modeling using a Two-Dimensional Grid [47.39346022004215]
This paper proposes to build a single end-to-end bidirectional translation model using a two-dimensional grid.
Instead of training two models independently, our approach encourages a single network to jointly learn to translate in both directions.
arXiv Detail & Related papers (2020-11-24T15:42:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.