Adapting Offline Speech Translation Models for Streaming with
Future-Aware Distillation and Inference
- URL: http://arxiv.org/abs/2303.07914v2
- Date: Thu, 26 Oct 2023 11:27:24 GMT
- Title: Adapting Offline Speech Translation Models for Streaming with
Future-Aware Distillation and Inference
- Authors: Biao Fu, Minpeng Liao, Kai Fan, Zhongqiang Huang, Boxing Chen, Yidong
Chen, Xiaodong Shi
- Abstract summary: A popular approach to streaming speech translation is to employ a single offline model with a wait-k policy to support different latency requirements.
There is a mismatch problem in using a model trained with complete utterances for streaming inference with partial input.
We propose a new approach called Future-Aware Streaming Translation (FAST) that adapts an offline ST model for streaming input.
- Score: 34.50987690518264
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A popular approach to streaming speech translation is to employ a single
offline model with a wait-k policy to support different latency requirements,
which is simpler than training multiple online models with different latency
constraints. However, there is a mismatch problem in using a model trained with
complete utterances for streaming inference with partial input. We demonstrate
that speech representations extracted at the end of a streaming input are
significantly different from those extracted from a complete utterance. To
address this issue, we propose a new approach called Future-Aware Streaming
Translation (FAST) that adapts an offline ST model for streaming input. FAST
includes a Future-Aware Inference (FAI) strategy that incorporates future
context through a trainable masked embedding, and a Future-Aware Distillation
(FAD) framework that transfers future context from an approximation of full
speech to streaming input. Our experiments on the MuST-C EnDe, EnEs, and EnFr
benchmarks show that FAST achieves better trade-offs between translation
quality and latency than strong baselines. Extensive analyses suggest that our
methods effectively alleviate the aforementioned mismatch problem between
offline training and online inference.
Related papers
- FASST: Fast LLM-based Simultaneous Speech Translation [9.65638081954595]
Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly.
We propose FASST, a fast large language model based method for streaming speech translation.
Experiment results show that FASST achieves the best quality-latency trade-off.
arXiv Detail & Related papers (2024-08-18T10:12:39Z) - Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio.
We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming
Disfluency Detection [3.884530687475798]
Streaming BERT-based sequence tagging model is capable of detecting disfluencies in real-time.
Model attains state-of-the-art latency and stability scores when compared with recent work on incremental disfluency detection.
arXiv Detail & Related papers (2022-05-02T02:13:24Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Real-time Object Detection for Streaming Perception [84.2559631820007]
Streaming perception is proposed to jointly evaluate the latency and accuracy into a single metric for video online perception.
We build a simple and effective framework for streaming perception.
Our method achieves competitive performance on Argoverse-HD dataset and improves the AP by 4.9% compared to the strong baseline.
arXiv Detail & Related papers (2022-03-23T11:33:27Z) - UniST: Unified End-to-end Model for Streaming and Non-streaming Speech
Translation [12.63410397982031]
We develop a unified model (UniST) which supports streaming and non-streaming speech translation.
Experiments on the most popular speech-to-text translation benchmark dataset, MuST-C, show that UniST achieves significant improvement for non-streaming ST.
arXiv Detail & Related papers (2021-09-15T15:22:10Z) - Multi-mode Transformer Transducer with Stochastic Future Context [53.005638503544866]
Multi-mode speech recognition models can process longer future context to achieve higher accuracy and when a latency budget is not flexible, the model can still achieve reliable accuracy.
We show that a Multi-mode ASR model rivals, if not surpasses, a set of competitive streaming baselines trained with different latency budgets.
arXiv Detail & Related papers (2021-06-17T18:42:11Z) - Streaming Models for Joint Speech Recognition and Translation [11.657994715914748]
We develop an end-to-end streaming ST model based on a re-translation approach and compare against standard cascading approaches.
We also introduce a novel inference method for the joint case, interleaving both transcript and translation in generation and removing the need to use separate decoders.
arXiv Detail & Related papers (2021-01-22T15:16:54Z) - Streaming end-to-end multi-talker speech recognition [34.76106500736099]
We propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition.
Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints.
Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT.
arXiv Detail & Related papers (2020-11-26T06:28:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.