Blockwise Streaming Transformer for Spoken Language Understanding and
Simultaneous Speech Translation
- URL: http://arxiv.org/abs/2204.08920v1
- Date: Tue, 19 Apr 2022 14:38:40 GMT
- Title: Blockwise Streaming Transformer for Spoken Language Understanding and
Simultaneous Speech Translation
- Authors: Keqi Deng, Shinji Watanabe, Jiatong Shi, Siddhant Arora
- Abstract summary: This paper takes the first step on streaming spoken language understanding (SLU) and speech translation (ST) using a blockwise streaming Transformer.
We propose a cross-lingual encoding method, which employs a CTC branch optimized with target language translations.
Experimental results show that the blockwise streaming Transformer achieves competitive results compared to offline models.
- Score: 35.31787938396058
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although Transformers have gained success in several speech processing tasks
like spoken language understanding (SLU) and speech translation (ST), achieving
online processing while keeping competitive performance is still essential for
real-world interaction. In this paper, we take the first step on streaming SLU
and simultaneous ST using a blockwise streaming Transformer, which is based on
contextual block processing and blockwise synchronous beam search. Furthermore,
we design an automatic speech recognition (ASR)-based intermediate loss
regularization for the streaming SLU task to improve the classification
performance further. As for the simultaneous ST task, we propose a
cross-lingual encoding method, which employs a CTC branch optimized with target
language translations. In addition, the CTC translation output is also used to
refine the search space with CTC prefix score, achieving joint CTC/attention
simultaneous translation for the first time. Experiments for SLU are conducted
on FSC and SLURP corpora, while the ST task is evaluated on Fisher-CallHome
Spanish and MuST-C En-De corpora. Experimental results show that the blockwise
streaming Transformer achieves competitive results compared to offline models,
especially with our proposed methods that further yield a 2.4% accuracy gain on
the SLU task and a 4.3 BLEU gain on the ST task over streaming baselines.
Related papers
- FASST: Fast LLM-based Simultaneous Speech Translation [9.65638081954595]
Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly.
We propose FASST, a fast large language model based method for streaming speech translation.
Experiment results show that FASST achieves the best quality-latency trade-off.
arXiv Detail & Related papers (2024-08-18T10:12:39Z) - Label-Synchronous Neural Transducer for E2E Simultaneous Speech Translation [14.410024368174872]
This paper presents the LS-Transducer-SST, a label-synchronous neural transducer for simultaneous speech translation (SST)
The LS-Transducer-SST dynamically decides when to emit translation tokens based on an Auto-regressive Integrate-and-Fire mechanism.
Experiments on the Fisher-CallHome Spanish (Es-En) and MuST-C En-De data show that the LS-Transducer-SST gives a better quality-latency trade-off than existing popular methods.
arXiv Detail & Related papers (2024-06-06T22:39:43Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - Token-Level Serialized Output Training for Joint Streaming ASR and ST
Leveraging Textual Alignments [49.38965743465124]
This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder.
Experiments in monolingual and multilingual settings demonstrate that our approach achieves the best quality-latency balance.
arXiv Detail & Related papers (2023-07-07T02:26:18Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - UniST: Unified End-to-end Model for Streaming and Non-streaming Speech
Translation [12.63410397982031]
We develop a unified model (UniST) which supports streaming and non-streaming speech translation.
Experiments on the most popular speech-to-text translation benchmark dataset, MuST-C, show that UniST achieves significant improvement for non-streaming ST.
arXiv Detail & Related papers (2021-09-15T15:22:10Z) - Worse WER, but Better BLEU? Leveraging Word Embedding as Intermediate in
Multitask End-to-End Speech Translation [127.54315184545796]
Speech translation (ST) aims to learn transformations from speech in the source language to the text in the target language.
We propose to improve the multitask ST model by utilizing word embedding as the intermediate.
arXiv Detail & Related papers (2020-05-21T14:22:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.