UniST: Unified End-to-end Model for Streaming and Non-streaming Speech
Translation
- URL: http://arxiv.org/abs/2109.07368v1
- Date: Wed, 15 Sep 2021 15:22:10 GMT
- Title: UniST: Unified End-to-end Model for Streaming and Non-streaming Speech
Translation
- Authors: Qianqian Dong, Yaoming Zhu, Mingxuan Wang, Lei Li
- Abstract summary: We develop a unified model (UniST) which supports streaming and non-streaming speech translation.
Experiments on the most popular speech-to-text translation benchmark dataset, MuST-C, show that UniST achieves significant improvement for non-streaming ST.
- Score: 12.63410397982031
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a unified end-to-end frame-work for both streaming and
non-streamingspeech translation. While the training recipes for non-streaming
speech translation have been mature, the recipes for streaming
speechtranslation are yet to be built. In this work, wefocus on developing a
unified model (UniST) which supports streaming and non-streaming ST from the
perspective of fundamental components, including training objective, attention
mechanism and decoding policy. Experiments on the most popular speech-to-text
translation benchmark dataset, MuST-C, show that UniST achieves significant
improvement for non-streaming ST, and a better-learned trade-off for BLEU score
and latency metrics for streaming ST, compared with end-to-end baselines and
the cascaded models. We will make our codes and evaluation tools publicly
available.
Related papers
- FASST: Fast LLM-based Simultaneous Speech Translation [9.65638081954595]
Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly.
We propose FASST, a fast large language model based method for streaming speech translation.
Experiment results show that FASST achieves the best quality-latency trade-off.
arXiv Detail & Related papers (2024-08-18T10:12:39Z) - CMU's IWSLT 2024 Simultaneous Speech Translation System [80.15755988907506]
This paper describes CMU's submission to the IWSLT 2024 Simultaneous Speech Translation (SST) task for translating English speech to German text in a streaming manner.
Our end-to-end speech-to-text (ST) system integrates the WavLM speech encoder, a modality adapter, and the Llama2-7B-Base model as the decoder.
arXiv Detail & Related papers (2024-08-14T10:44:51Z) - StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection [23.75894159181602]
Streaming speech-to-text translation (StreamST) is the task of automatically translating speech while incrementally receiving an audio stream.
We introduce StreamAtt, the first StreamST policy, and propose StreamLAAL, the first StreamST latency metric.
arXiv Detail & Related papers (2024-06-10T08:27:58Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - Token-Level Serialized Output Training for Joint Streaming ASR and ST
Leveraging Textual Alignments [49.38965743465124]
This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder.
Experiments in monolingual and multilingual settings demonstrate that our approach achieves the best quality-latency balance.
arXiv Detail & Related papers (2023-07-07T02:26:18Z) - Adapting Offline Speech Translation Models for Streaming with
Future-Aware Distillation and Inference [34.50987690518264]
A popular approach to streaming speech translation is to employ a single offline model with a wait-k policy to support different latency requirements.
There is a mismatch problem in using a model trained with complete utterances for streaming inference with partial input.
We propose a new approach called Future-Aware Streaming Translation (FAST) that adapts an offline ST model for streaming input.
arXiv Detail & Related papers (2023-03-14T13:56:36Z) - M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation [66.92823764664206]
We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to text.
While shrinking the speech sequence, M-Adapter produces features desired for speech-to-text translation.
Our experimental results show that our model outperforms a strong baseline by up to 1 BLEU.
arXiv Detail & Related papers (2022-07-03T04:26:53Z) - Revisiting End-to-End Speech-to-Text Translation From Scratch [48.203394370942505]
End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks.
In this paper, we explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved.
arXiv Detail & Related papers (2022-06-09T15:39:19Z) - Streaming Models for Joint Speech Recognition and Translation [11.657994715914748]
We develop an end-to-end streaming ST model based on a re-translation approach and compare against standard cascading approaches.
We also introduce a novel inference method for the joint case, interleaving both transcript and translation in generation and removing the need to use separate decoders.
arXiv Detail & Related papers (2021-01-22T15:16:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.