StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection
- URL: http://arxiv.org/abs/2406.06097v1
- Date: Mon, 10 Jun 2024 08:27:58 GMT
- Title: StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection
- Authors: Sara Papi, Marco Gaido, Matteo Negri, Luisa Bentivogli,
- Abstract summary: Streaming speech-to-text translation (StreamST) is the task of automatically translating speech while incrementally receiving an audio stream.
We introduce StreamAtt, the first StreamST policy, and propose StreamLAAL, the first StreamST latency metric.
- Score: 23.75894159181602
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Streaming speech-to-text translation (StreamST) is the task of automatically translating speech while incrementally receiving an audio stream. Unlike simultaneous ST (SimulST), which deals with pre-segmented speech, StreamST faces the challenges of handling continuous and unbounded audio streams. This requires additional decisions about what to retain of the previous history, which is impractical to keep entirely due to latency and computational constraints. Despite the real-world demand for real-time ST, research on streaming translation remains limited, with existing works solely focusing on SimulST. To fill this gap, we introduce StreamAtt, the first StreamST policy, and propose StreamLAAL, the first StreamST latency metric designed to be comparable with existing metrics for SimulST. Extensive experiments across all 8 languages of MuST-C v1.0 show the effectiveness of StreamAtt compared to a naive streaming baseline and the related state-of-the-art SimulST policy, providing a first step in StreamST research.
Related papers
- SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer [68.78023656892319]
This paper presents a dual-stream text-to-speech (TTS) model, SyncSpeech, capable of receiving streaming text input from upstream models while simultaneously generating streaming speech.
SyncSpeech has the following advantages: Low latency, as it begins generating streaming speech upon receiving the second text token; High efficiency, as it decodes all speech tokens corresponding to the each arrived text token in one step.
arXiv Detail & Related papers (2025-02-16T12:14:17Z) - FASST: Fast LLM-based Simultaneous Speech Translation [9.65638081954595]
Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly.
We propose FASST, a fast large language model based method for streaming speech translation.
Experiment results show that FASST achieves the best quality-latency trade-off.
arXiv Detail & Related papers (2024-08-18T10:12:39Z) - CMU's IWSLT 2024 Simultaneous Speech Translation System [80.15755988907506]
This paper describes CMU's submission to the IWSLT 2024 Simultaneous Speech Translation (SST) task for translating English speech to German text in a streaming manner.
Our end-to-end speech-to-text (ST) system integrates the WavLM speech encoder, a modality adapter, and the Llama2-7B-Base model as the decoder.
arXiv Detail & Related papers (2024-08-14T10:44:51Z) - StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning [48.84039953531356]
StreamSpeech is a direct Simul-S2ST model that jointly learns translation and simultaneous policy.
Experiments on CVSS benchmark demonstrate that StreamSpeech achieves state-of-the-art performance in both offline S2ST and Simul-S2ST tasks.
arXiv Detail & Related papers (2024-06-05T08:24:22Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation [55.1650189699753]
Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date.
Current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech.
We present AV-TranSpeech, the first audio-visual speech-to-speech model without relying on intermediate text.
arXiv Detail & Related papers (2023-05-24T17:59:03Z) - Adapting Offline Speech Translation Models for Streaming with
Future-Aware Distillation and Inference [34.50987690518264]
A popular approach to streaming speech translation is to employ a single offline model with a wait-k policy to support different latency requirements.
There is a mismatch problem in using a model trained with complete utterances for streaming inference with partial input.
We propose a new approach called Future-Aware Streaming Translation (FAST) that adapts an offline ST model for streaming input.
arXiv Detail & Related papers (2023-03-14T13:56:36Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - UniST: Unified End-to-end Model for Streaming and Non-streaming Speech
Translation [12.63410397982031]
We develop a unified model (UniST) which supports streaming and non-streaming speech translation.
Experiments on the most popular speech-to-text translation benchmark dataset, MuST-C, show that UniST achieves significant improvement for non-streaming ST.
arXiv Detail & Related papers (2021-09-15T15:22:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.