Large-Scale Streaming End-to-End Speech Translation with Neural
Transducers
- URL: http://arxiv.org/abs/2204.05352v1
- Date: Mon, 11 Apr 2022 18:18:53 GMT
- Title: Large-Scale Streaming End-to-End Speech Translation with Neural
Transducers
- Authors: Jian Xue, Peidong Wang, Jinyu Li, Matt Post, Yashesh Gaur
- Abstract summary: We introduce a streaming end-to-end speech translation (ST) model to convert audio signals to texts in other languages directly.
Compared with cascaded ST that performs ASR followed by text-based machine translation (MT), the proposed Transformer transducer (TT)-based ST model drastically reduces inference latency.
We extend TT-based ST to multilingual ST, which generates texts of multiple languages at the same time.
- Score: 35.2855796745394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural transducers have been widely used in automatic speech recognition
(ASR). In this paper, we introduce it to streaming end-to-end speech
translation (ST), which aims to convert audio signals to texts in other
languages directly. Compared with cascaded ST that performs ASR followed by
text-based machine translation (MT), the proposed Transformer transducer
(TT)-based ST model drastically reduces inference latency, exploits speech
information, and avoids error propagation from ASR to MT. To improve the
modeling capacity, we propose attention pooling for the joint network in TT. In
addition, we extend TT-based ST to multilingual ST, which generates texts of
multiple languages at the same time. Experimental results on a large-scale 50
thousand (K) hours pseudo-labeled training set show that TT-based ST not only
significantly reduces inference time but also outperforms non-streaming
cascaded ST for English-German translation.
Related papers
- FASST: Fast LLM-based Simultaneous Speech Translation [9.65638081954595]
Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly.
We propose FASST, a fast large language model based method for streaming speech translation.
Experiment results show that FASST achieves the best quality-latency trade-off.
arXiv Detail & Related papers (2024-08-18T10:12:39Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - Token-Level Serialized Output Training for Joint Streaming ASR and ST
Leveraging Textual Alignments [49.38965743465124]
This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder.
Experiments in monolingual and multilingual settings demonstrate that our approach achieves the best quality-latency balance.
arXiv Detail & Related papers (2023-07-07T02:26:18Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Back Translation for Speech-to-text Translation Without Transcripts [11.13240570688547]
We develop a back translation algorithm for ST (BT4ST) to synthesize pseudo ST data from monolingual target data.
To ease the challenges posed by short-to-long generation and one-to-many mapping, we introduce self-supervised discrete units.
With our synthetic ST data, we achieve an average boost of 2.3 BLEU on MuST-C En-De, En-Fr, and En-Es datasets.
arXiv Detail & Related papers (2023-05-15T15:12:40Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Tackling data scarcity in speech translation using zero-shot
multilingual machine translation techniques [12.968557512440759]
Several techniques have been proposed for zero-shot translation.
We investigate whether these ideas can be applied to speech translation, by building ST models trained on speech transcription and text translation data.
The techniques were successfully applied to few-shot ST using limited ST data, with improvements of up to +12.9 BLEU points compared to direct end-to-end ST and +3.1 BLEU points compared to ST models fine-tuned from ASR model.
arXiv Detail & Related papers (2022-01-26T20:20:59Z) - Incremental Speech Synthesis For Speech-To-Speech Translation [23.951060578077445]
We focus on improving the incremental synthesis performance of TTS models.
With a simple data augmentation strategy based on prefixes, we are able to improve the incremental TTS quality to approach offline performance.
We propose latency metrics tailored to S2ST applications, and investigate methods for latency reduction in this context.
arXiv Detail & Related papers (2021-10-15T17:20:28Z) - Zero-shot Speech Translation [0.0]
Speech Translation (ST) is the task of translating speech in one language into text in another language.
End-to-end approaches use only one system to avoid propagating error, yet are difficult to employ due to data scarcity.
We explore zero-shot translation, which enables translating a pair of languages that is unseen during training.
arXiv Detail & Related papers (2021-07-13T12:00:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.