DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct
Speech-to-Speech Translation
- URL: http://arxiv.org/abs/2310.17570v1
- Date: Thu, 26 Oct 2023 16:58:14 GMT
- Title: DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct
Speech-to-Speech Translation
- Authors: Yongxin Zhu, Zhujin Gao, Xinyuan Zhou, Zhongyi Ye, Linli Xu
- Abstract summary: We propose a novel diffusion model by applying the diffusion forward process in the textitcontinuous speech representation space.
In this way, we preserve the semantic structure of the continuous speech representation space in the diffusion process and integrate the continuous and discrete diffusion models.
We conduct extensive experiments on the textless direct speech-to-speech translation task, where the proposed method achieves comparable results to the computationally intensive auto-regressive baselines.
- Score: 10.984745439751489
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Diffusion Generative Models have achieved great success on image
generation tasks, how to efficiently and effectively incorporate them into
speech generation especially translation tasks remains a non-trivial problem.
Specifically, due to the low information density of speech data, the
transformed discrete speech unit sequence is much longer than the corresponding
text transcription, posing significant challenges to existing auto-regressive
models. Furthermore, it is not optimal to brutally apply discrete diffusion on
the speech unit sequence while disregarding the continuous space structure,
which will degrade the generation performance significantly. In this paper, we
propose a novel diffusion model by applying the diffusion forward process in
the \textit{continuous} speech representation space, while employing the
diffusion backward process in the \textit{discrete} speech unit space. In this
way, we preserve the semantic structure of the continuous speech representation
space in the diffusion process and integrate the continuous and discrete
diffusion models. We conduct extensive experiments on the textless direct
speech-to-speech translation task, where the proposed method achieves
comparable results to the computationally intensive auto-regressive baselines
(500 steps on average) with significantly fewer decoding steps (50 steps).
Related papers
- Discrete Diffusion Language Model for Long Text Summarization [19.267738861590487]
We introduce a novel semantic-aware noising process that enables Transformer backbones to handle long sequences effectively.
Our approaches achieve state-of-the-art performance on three benchmark summarization datasets: Gigaword, CNN/DailyMail, and Arxiv.
arXiv Detail & Related papers (2024-06-25T09:55:22Z) - Text Diffusion with Reinforced Conditioning [92.17397504834825]
This paper thoroughly analyzes text diffusion models and uncovers two significant limitations: degradation of self-conditioning during training and misalignment between training and sampling.
Motivated by our findings, we propose a novel Text Diffusion model called TREC, which mitigates the degradation with Reinforced Conditioning and the misalignment by Time-Aware Variance Scaling.
arXiv Detail & Related papers (2024-02-19T09:24:02Z) - Investigating the Design Space of Diffusion Models for Speech Enhancement [17.914763947871368]
Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature.
We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals.
We also show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system.
arXiv Detail & Related papers (2023-12-07T15:40:55Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - TESS: Text-to-Text Self-Conditioned Simplex Diffusion [56.881170312435444]
Text-to-text Self-conditioned Simplex Diffusion employs a new form of self-conditioning, and applies the diffusion process on the logit simplex space rather than the learned embedding space.
We demonstrate that TESS outperforms state-of-the-art non-autoregressive models, requires fewer diffusion steps with minimal drop in performance, and is competitive with pretrained autoregressive sequence-to-sequence models.
arXiv Detail & Related papers (2023-05-15T06:33:45Z) - DiffVoice: Text-to-Speech with Latent Diffusion [18.150627638754923]
We present DiffVoice, a novel text-to-speech model based on latent diffusion.
Subjective evaluations on LJSpeech and LibriTTS datasets demonstrate that our method beats the best publicly available systems in naturalness.
arXiv Detail & Related papers (2023-04-23T21:05:33Z) - A Cheaper and Better Diffusion Language Model with Soft-Masked Noise [62.719656543880596]
Masked-Diffuse LM is a novel diffusion model for language modeling, inspired by linguistic features in languages.
Specifically, we design a linguistic-informed forward process which adds corruptions to the text through strategically soft-masking to better noise the textual data.
We demonstrate that our Masked-Diffuse LM can achieve better generation quality than the state-of-the-art diffusion models with better efficiency.
arXiv Detail & Related papers (2023-04-10T17:58:42Z) - SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers [50.90457644954857]
In this work, we apply diffusion models to approach sequence-to-sequence text generation.
We propose SeqDiffuSeq, a text diffusion model for sequence-to-sequence generation.
Experiment results illustrate the good performance on sequence-to-sequence generation in terms of text quality and inference time.
arXiv Detail & Related papers (2022-12-20T15:16:24Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.