Token-Level Serialized Output Training for Joint Streaming ASR and ST
Leveraging Textual Alignments
- URL: http://arxiv.org/abs/2307.03354v2
- Date: Mon, 2 Oct 2023 08:59:09 GMT
- Title: Token-Level Serialized Output Training for Joint Streaming ASR and ST
Leveraging Textual Alignments
- Authors: Sara Papi, Peidong Wang, Junkun Chen, Jian Xue, Jinyu Li, Yashesh Gaur
- Abstract summary: This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder.
Experiments in monolingual and multilingual settings demonstrate that our approach achieves the best quality-latency balance.
- Score: 49.38965743465124
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In real-world applications, users often require both translations and
transcriptions of speech to enhance their comprehension, particularly in
streaming scenarios where incremental generation is necessary. This paper
introduces a streaming Transformer-Transducer that jointly generates automatic
speech recognition (ASR) and speech translation (ST) outputs using a single
decoder. To produce ASR and ST content effectively with minimal latency, we
propose a joint token-level serialized output training method that interleaves
source and target words by leveraging an off-the-shelf textual aligner.
Experiments in monolingual (it-en) and multilingual (\{de,es,it\}-en) settings
demonstrate that our approach achieves the best quality-latency balance. With
an average ASR latency of 1s and ST latency of 1.3s, our model shows no
degradation or even improves output quality compared to separate ASR and ST
models, yielding an average improvement of 1.1 WER and 0.4 BLEU in the
multilingual case.
Related papers
- FASST: Fast LLM-based Simultaneous Speech Translation [9.65638081954595]
Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly.
We propose FASST, a fast large language model based method for streaming speech translation.
Experiment results show that FASST achieves the best quality-latency trade-off.
arXiv Detail & Related papers (2024-08-18T10:12:39Z) - DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation [29.76274107159478]
Non-autoregressive Transformers (NATs) are applied in direct speech-to-speech translation systems.
We introduce DiffNorm, a diffusion-based normalization strategy that simplifies data distributions for training NAT models.
Our strategies result in a notable improvement of about +7 ASR-BLEU for English-Spanish (En-Es) and +2 ASR-BLEU for English-French (En-Fr) on the CVSS benchmark.
arXiv Detail & Related papers (2024-05-22T01:10:39Z) - Leveraging Timestamp Information for Serialized Joint Streaming
Recognition and Translation [51.399695200838586]
We propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder.
Experiments on it,es,de->en prove the effectiveness of our approach, enabling the generation of one-to-many joint outputs with a single decoder for the first time.
arXiv Detail & Related papers (2023-10-23T11:00:27Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation [71.35243644890537]
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions.
Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space.
We propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text.
arXiv Detail & Related papers (2022-10-18T03:06:47Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized
Streaming ASR [21.622039537743607]
Simultaneous speech-to-text translation is widely useful in many scenarios.
Recent efforts attempt to directly translate the source speech into target text simultaneously, but this is much harder due to the combination of two separate tasks.
We propose a new paradigm with the advantages of both cascaded and end-to-end approaches.
arXiv Detail & Related papers (2021-06-11T23:22:37Z) - RealTranS: End-to-End Simultaneous Speech Translation with Convolutional
Weighted-Shrinking Transformer [33.876412404781846]
RealTranS is an end-to-end model for simultaneous speech translation.
It maps speech features into text space with a weighted-shrinking operation and a semantic encoder.
Experiments show that RealTranS with the Wait-K-Stride-N strategy outperforms prior end-to-end models.
arXiv Detail & Related papers (2021-06-09T06:35:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.