TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation
- URL: http://arxiv.org/abs/2205.12523v1
- Date: Wed, 25 May 2022 06:34:14 GMT
- Title: TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation
- Authors: Rongjie Huang, Zhou Zhao, Jinglin Liu, Huadai Liu, Yi Ren, Lichao
Zhang, Jinzheng He
- Abstract summary: TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
- Score: 61.564874831498145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Direct speech-to-speech translation (S2ST) systems leverage recent progress
in speech representation learning, where a sequence of discrete representations
(units) derived in a self-supervised manner, are predicted from the model and
passed to a vocoder for speech synthesis, still facing the following
challenges: 1) Acoustic multimodality: the discrete units derived from speech
with same content could be indeterministic due to the acoustic property (e.g.,
rhythm, pitch, and energy), which causes deterioration of translation accuracy;
2) high latency: current S2ST systems utilize autoregressive models which
predict each unit conditioned on the sequence previously generated, failing to
take full advantage of parallelism. In this work, we propose TranSpeech, a
speech-to-speech translation model with bilateral perturbation. To alleviate
the acoustic multimodal problem, we propose bilateral perturbation, which
consists of the style normalization and information enhancement stages, to
learn only the linguistic information from speech samples and generate more
deterministic representations. With reduced multimodality, we step forward and
become the first to establish a non-autoregressive S2ST technique, which
repeatedly masks and predicts unit choices and produces high-accuracy results
in just a few cycles. Experimental results on three language pairs demonstrate
the state-of-the-art results by up to 2.5 BLEU points over the best
publicly-available textless S2ST baseline. Moreover, TranSpeech shows a
significant improvement in inference latency, enabling speedup up to 21.4x than
autoregressive technique. Audio samples are available at
\url{https://TranSpeech.github.io/}
Related papers
- A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X)
NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework.
It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - DASpeech: Directed Acyclic Transformer for Fast and High-quality
Speech-to-Speech Translation [36.126810842258706]
Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model.
Due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution.
We propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST.
arXiv Detail & Related papers (2023-10-11T11:39:36Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Duplex Diffusion Models Improve Speech-to-Speech Translation [1.4649095013539173]
Speech-to-speech translation is a sequence-to-sequence learning task that naturally has two directions.
We propose a duplex diffusion model that applies diffusion probabilistic models to both sides of a reversible duplex Conformer.
Our model enables reversible speech translation by simply flipping the input and output ends.
arXiv Detail & Related papers (2023-05-22T01:39:40Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Textless Direct Speech-to-Speech Translation with Discrete Speech
Representation [27.182170555234226]
We propose a novel model, Textless Translatotron, for training an end-to-end direct S2ST model without any textual supervision.
When a speech encoder pre-trained with unsupervised speech data is used for both models, the proposed model obtains translation quality nearly on-par with Translatotron 2.
arXiv Detail & Related papers (2022-10-31T19:48:38Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - WavThruVec: Latent speech representation as intermediate features for
neural speech synthesis [1.1470070927586016]
WavThruVec is a two-stage architecture that resolves the bottleneck by using high-dimensional Wav2Vec 2.0 embeddings as intermediate speech representation.
We show that the proposed model not only matches the quality of state-of-the-art neural models, but also presents useful properties enabling tasks like voice conversion or zero-shot synthesis.
arXiv Detail & Related papers (2022-03-31T10:21:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.