DASpeech: Directed Acyclic Transformer for Fast and High-quality
Speech-to-Speech Translation
- URL: http://arxiv.org/abs/2310.07403v1
- Date: Wed, 11 Oct 2023 11:39:36 GMT
- Title: DASpeech: Directed Acyclic Transformer for Fast and High-quality
Speech-to-Speech Translation
- Authors: Qingkai Fang, Yan Zhou, Yang Feng
- Abstract summary: Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model.
Due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution.
We propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST.
- Score: 36.126810842258706
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Direct speech-to-speech translation (S2ST) translates speech from one
language into another using a single model. However, due to the presence of
linguistic and acoustic diversity, the target speech follows a complex
multimodal distribution, posing challenges to achieving both high-quality
translations and fast decoding speeds for S2ST models. In this paper, we
propose DASpeech, a non-autoregressive direct S2ST model which realizes both
fast and high-quality S2ST. To better capture the complex distribution of the
target speech, DASpeech adopts the two-pass architecture to decompose the
generation process into two steps, where a linguistic decoder first generates
the target text, and an acoustic decoder then generates the target speech based
on the hidden states of the linguistic decoder. Specifically, we use the
decoder of DA-Transformer as the linguistic decoder, and use FastSpeech 2 as
the acoustic decoder. DA-Transformer models translations with a directed
acyclic graph (DAG). To consider all potential paths in the DAG during
training, we calculate the expected hidden states for each target token via
dynamic programming, and feed them into the acoustic decoder to predict the
target mel-spectrogram. During inference, we select the most probable path and
take hidden states on that path as input to the acoustic decoder. Experiments
on the CVSS Fr-En benchmark demonstrate that DASpeech can achieve comparable or
even better performance than the state-of-the-art S2ST model Translatotron 2,
while preserving up to 18.53x speedup compared to the autoregressive baseline.
Compared with the previous non-autoregressive S2ST model, DASpeech does not
rely on knowledge distillation and iterative decoding, achieving significant
improvements in both translation quality and decoding speed. Furthermore,
DASpeech shows the ability to preserve the speaker's voice of the source speech
during translation.
Related papers
- A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X)
NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework.
It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z) - RepCodec: A Speech Representation Codec for Speech Tokenization [21.60885344868044]
RepCodec is a novel representation for semantic speech tokenization.
We show that RepCodec significantly outperforms the widely used k-means clustering approach in both speech understanding and generation.
arXiv Detail & Related papers (2023-08-31T23:26:10Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Dual-decoder Transformer for Joint Automatic Speech Recognition and
Multilingual Speech Translation [71.54816893482457]
We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST)
Our models are based on the original Transformer architecture but consist of two decoders, each responsible for one task (ASR or ST)
arXiv Detail & Related papers (2020-11-02T04:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.